Fascinating! AlphaFold (and other competitors) seem to use MSA (Multiple Sequenc...

flobosg · on Nov 30, 2020

> do MSA-based approaches also help understand "first-principles" folding physics any better?

Not really. MSA-based approaches, as most structure prediction methods, have as a goal to find the lowest energy conformation of the protein chain, disregarding folding kinetics and basically all dynamic aspects of protein structure.

> If I write a random genetic sequence (think drug discovery) that has many aligned sequences, without the strong assumption of co-evolution at my disposal, there does not seem any good reason for the aligned sequences to also be proximal.

I don't think I fully understood this, but I'll give it a shot anyway. If your artificial sequence aligns with others, there's a chance that it will fold like them, depending on the quality and accuracy of the multiple sequence alignment. Since multiple sequence alignments are built under the assumption of homology (all sequences have a common ancestor), it's a matter of how far from the "sequence sampling space" your sequence is located compared to the others.

heycosmo · on Nov 30, 2020

> I don't think I fully understood this, but I'll give it a shot anyway. If your artificial sequence aligns with others, there's a chance that it will fold like them, depending on the quality and accuracy of the multiple sequence alignment. Since multiple sequence alignments are built under the assumption of homology (all sequences have a common ancestor), it's a matter of how far from the "sequence sampling space" your sequence is located compared to the others.

I understand that similar sequences may fold similarly (although as length increases, I highly doubt it, but IDK). I'm talking about aligned sub-sequences within one chain and their ultimate distance from each other in the final structure. Co-evolution suggests that aligned sub-sequences are also proximal. But manufactured chains did not evolve, therefore the assumption is no longer useful.

flobosg · on Nov 30, 2020

Oh, I see! Yes, an intrachain alignment of an artificial sequence does not by itself give any information about co-evolution, especially since you don't know whether your protein is actually folding. To assess co-evolution you need a multiple sequence alignment between protein homologs containing correlated mutations.

> I understand that similar sequences may fold similarly (although as length increases, I highly doubt it, but IDK).

As long as the sequence similarity is kept between those sequences, length is not an issue.

> Co-evolution suggests that aligned sub-sequences are also proximal

What do you mean by "proximal"? Close in space, or similar in structure?

heycosmo · on Dec 2, 2020

> To assess co-evolution you need a multiple sequence alignment between protein homologs containing correlated mutations.

That makes sense. So in the CASP competition, when teams are given a sequence, do their algorithms do something like the following?

1. Search database for homologs of given sequence 2. Look at MSA and correlated mutations of homologs 3. Look for similar correlated mutations in given sequence

I imagine 1-3 could somehow be embedded in a NN after training on a protein database.

> What do you mean by "proximal"? Close in space, or similar in structure?

I mean close in space.

ashtonbaker · on Nov 30, 2020

This is a really insightful question and I need to take some time to fully understand the ensuing discussion.

If my speculation is correct, then drug discovery should use a process of genetic programming, using something like this to score the resulting amino acid sequences. I'm wondering if an artificial process of evolution would be sufficient to satisfy the co-evolution assumption here.

flobosg · on Nov 30, 2020

> I'm wondering if an artificial process of evolution would be sufficient to satisfy the co-evolution assumption here.

In principle yes, if you can generate a significant number of artificially evolved variants that are folded/functional.