A major challenge in population genetics is the inference of natural positive selection. Recent progress in the development of new statistics as well as unprecedented amounts of data hold promise for the future. We have implemented a large number of tests for natural selection in a bioinformatic framework, including Fst, CLR, XP-CLR, omega, Tajima´s D, Fay & Wu´s H, Fu & Li´s D, iHS, diHH, XPEHH, dDAF among others. We used extensive coalescent simulations of neutral and selected genomic regions in order to evaluate each statistic for (i) their sensitivity in detecting selection and for (ii) their robustness in diverse demographic scenarios. By combining the statistics in a single composite score through machine learning, we both improved the sensitivity and facilitate the interpretation as compared to the individual tests for selection. We have applied our methods to experimental data from the 1000 genomes project and will make the results accessible in a public genome browser. Some tracks are already available here.
Figure: The q-arm of chromosome 15 with diverse tests for positive selection (green; -log of empirically ranked scores) including the novel composite score (in blue). Note the maximum peak around position 48.000kb (near the skin color gene SLC24A5) and the corresponding signals of selection in many (but not all) of the different statistics (from: Pybus, Dall’Olio, Luisi, Uzkudun, Carreño-Torres, Pavlidis, Laayouni, Bertranpetit, Engelken; unpublished).