Issues with source code access aside, your description is mostly wrong. These programs take a DNA profile as input- it's just that the DNA profile is mixed (i.e. from multiple people). It reporting no DNA would be nonsensical. Figuring out exactly how many people are in a mixture isn't quite nailed down statistically (last I knew of), but it's usually pretty clear for up to 4 or so people.
Yes, you could run different models and get different probabilities. For example, the likelihood that the sample is a mixture of the suspect, the victim, and some unknown person vs victim and two unknown people compared to saying the victim isn't in the sample. However, the specification of those models is part of the trial process.
And the output probabilities (at least when being used to determine guilt) are usually quite high, orders of magnitude higher than 90% or even 99.99%.
My point is that the science behind these calculations is well developed- validation studies get published all the time. Whether or not the specific software has errors (or isn't coded exactly as modeled) is an entirely different matter, but it still isn't all that likely. All of these cases rely on expert witnesses anyway- it's not the prosecutor pressing some buttons and printing a report.
There is far more concerning quackery that gets used in forensics- bite marks, hair matching, etc.
Which link various pathogens native to the species consumed to cancer in humans as apparently the compounds that arise from cooking or processing meat which also appear in poultry and fish do not increase cancer risk by as much (or at all) as mammalian meat especially beef and pork despite having the same or even higher concentration of these compounds.
Depends how you look at it. The fulfillment centers are intentionally built out in the middle of nowhere where land and labour are both plentiful and inexpensive.
If you live in such a place and Amazon is your employer, it may well be the case that you don't have a lot of other options, especially if what you came from was being on social assistance.
So now they will have an even stronger foothold in the depressed areas that they provided people with jobs. EMTs, teachers, etc are going to start taking jobs the Amazon because it pays better.
>I'm guessing for many people an Amazon warehouse job is stable employment near their small town that might not have many opportunities otherwise.
If they do their research, they'll realize they probably won't have a job for that long [1] and it would be preferable to stay in a position (that likely pays better with your examples) that is reliable.
Land may be plentiful in the middle of nowhere but labor wouldn't be. Middle of nowhere is usually sparsely populated, that's why it's middle of nowhere.
True, but given pretty hard cap on commute speed, I don't think it changes much. It's not unheard of to commute 50 miles to work now in the Bay Area (it's basically San Jose to San Francisco), how much more you can do in the middle of nowhere - double? I don't think many people would drive 3 hours there and 3 hours back every day.
Minor nitpicking: the 23andMe kits use microarrays for genotyping, not DNA sequencing. It's an older and more limited technology but it is much cheaper than sequencing.
To add to your points, a very important consideration is that charter schools can game the metrics by forcing out students who aren't performing well, while public schools can't turn anyone away. This may be why they seem to improve test scores without making any actual difference [1].
Public schools can use charters to game metrics themselves too [2].
plus a special sauce for counting the number of specific bp repeats, due to in-del events, this is not something I am not too familiar, but presumably the number of a specific k-mer repeats you have in these genes of interest might correlate to a specific type of cancer? (would love to hear someone who is an expert in this field their opinion).
"Copy number variant" refers to larger deletions and duplications that can occur in the genome. There isn't some specific cutoff for size, but some examples in these kinds of genes would be an entire exon or gene. There are countless studies that find correlations between specific variants or CNVs and risk of cancers.
Standard variant detection is pretty straightforward. CNVs are harder because they are longer (several hundred to several thousand base pairs) than the raw data (150 to 250 bp for Illumina)- you don't get single reads that span the entire variant. You have to normalize then look for differences in coverage, or look for split reads (where the read is aligned on the border of one of these CNVs).
This kind of funding baffles me because they don't seem to be proposing anything new at all (maybe slightly better CNV detection?) and there are already lots of labs/companies doing this kind of testing. Maybe they are working on being very efficient to offer a better price.
According to Sanger (or maybe TCGA?), a gain is when a genomic region (for a diploid) has more than five absolute copies of this region and a loss is when the genomic region has no reads ((http://cancer.sanger.ac.uk/cosmic/help/cnv/overview), where the copy number is perhaps determined by that normalized distribution of read coverage across the reference genome?
Then for small CNVs that perhaps span that 150-200bp fragment, we use the split read method to filter for incompletely mapped reads that are only aligned on the edges to the reference. This implies that there was a duplication event that expanded that sequence? (Fig 1b. Split Read method).
Presumably, the pipeline would determine the CNV sites in a specific patient sample, then cross-reference with the TCGA CNV data-set and come up with correlation score of how much those CNVs sites match with consensus CNVs in the cancer data-set? Thanks again for your detailed breakdown.
The Sanger/TCGA (The Cancer Genome Atlas) stuff seems to be specific to microarray data which is different (older, more expensive) than the newer high-throughput data.
The figure you linked is a good explanation. The split read method is helpful for finding the edges of the CNV, while the number of reads (relative to other regions that were tested) can give an idea of the number of copies. The problem is that these methods all have their own unique biases/noise that makes it non-trivial to figure out the absolute copy number change.
Ideally they would find a similar CNV that has some clinical association.
Thanks jrm5100 for the link. I see the variants under the "DGV Structural Variants" track. Really appreciate your explaining what CNVs are and also following up on my questions/confusions!
Yes, you could run different models and get different probabilities. For example, the likelihood that the sample is a mixture of the suspect, the victim, and some unknown person vs victim and two unknown people compared to saying the victim isn't in the sample. However, the specification of those models is part of the trial process.
And the output probabilities (at least when being used to determine guilt) are usually quite high, orders of magnitude higher than 90% or even 99.99%.
My point is that the science behind these calculations is well developed- validation studies get published all the time. Whether or not the specific software has errors (or isn't coded exactly as modeled) is an entirely different matter, but it still isn't all that likely. All of these cases rely on expert witnesses anyway- it's not the prosecutor pressing some buttons and printing a report.
There is far more concerning quackery that gets used in forensics- bite marks, hair matching, etc.