机构地区: 浙江大学
出 处: 《世界科技研究与发展》 2003年第6期42-50,共9页
摘 要: 为了找出编码蛋白质的基因,注释流程结合了“从头开始的基因预测方法”和“与已知基因相似性比较”这两种方法。“从头开始的基因预测方法”虽然有很高的假阳性但是假阴性却很低;相形之下,结合了相似性比对的方法之后虽然能够降低假阳性,但是却大大提高了假阴性。我们发现,在这当中与基因预测正确率相关的最重要因素就是基因大小(包括内含子在内)——大基因尤其容易产生预测错误。 To find unknown protein-coding genes, annotation pipelines use a combination of ab initio gene prediction and similarity to experimentally confirmed genes or proteins. Here, we show that although the ab initio predictions have an intrinsically high false-positive rate, they also have a consistently low false-negative rate. The incorporation of similarity information is meant to reduce the false-positive rate, but in doing so it increases the false-negative rate. The crucial variable is gene size {including introns) -genes of the most extreme sizes, especially very large genes, are most likely to be incorrectly predicted.