2 sequences (236T/G, 240A/G and 561T/G) or 5’ of the Amoebapore C

2 sequences (236T/G, 240A/G and 561T/G) or 5’ of the Amoebapore C transcript XM_650937.2 (407A/C and 422A) seemed to be present only the two to four Bangladesh isolates sequenced by Bhattacharya et al. and were not present in the available international sequenced whole genomes [36]. The goal of this work was to develop a set of less variable markers to profile a large number of strains from different regions of the globe, therefore we selected additional non-synonomous SNPs which Bhattacharya et

al. had shown to be less variable, to probe the population structure of E. histolytica in depth [36]. The new SNPs were present with a frequency of 0.3-0.6 in the pool of geographically p53 activator disparate E. histolytica parasites TH-302 mw whose genomes had been sequenced. We restricted our SNP candidates for initial

analysis to genes with the potential to be involved in the virulence of this parasite [8–17]. As our current hypothesis is that the development of disease is multifactorial, or polygenic, and involves a combination of parasite factors in the current work we selected several loci to test for their association with disease outcome in E. histolytica. These loci contained SNPs that resulted in non-synonomous changes to the encoded amino acids, were present in more than three of the sequenced E. histolytica genomes, and enriched either in strains originating from symptomatic or asymptomatic infections. We have shown that two of these SNPs were significantly associated with disease severity in Bangladesh isolates. Results Initial identification and validation of single nucleotide polymorphisms identified using Next Generation Sequencing The genome sequencing projects of multiple E. histolytica

strains performed at the J. Craig Venter Institute (JCVI) and at the Institute of Integrative Biology (University of Liverpool) provided the sequence data used only for the identification of SNPs (Table 1) [35]. A total of 10,855 SNPs within coding DNA were identified in the sequenced genomes (Additional file 1: Table S1). Each strain had approximately 1,500 homozygous and 1,000 heterozygous SNPs. Half of all the SNPs identified were unique and present in only one strain (“private” SNPs). Like Ghosh et al. we identified mainly dimorphic SNPs, while potential tri- and tetrazygote variants were very infrequent [22]. This, however, may reflect a bias in SNP detection programs because selleck inhibitor Mukherjee et al. observed considerable heterogeneity in the ploidy of E. histolytica [38]. Table 1 Genomes sequenced by the Genomic Sequencing Center for Infectious Diseases (GSCID) and the Institute of Integrative Biology, E. histolytica Genome sequencing projects Strain id Genbank identifier if available Source/reference GSCID E. histolytica Genome Sequencing Project MS96-3382 885314 R. Haque, unpublished data ICDDR,B DS4-868 885310 Ali et al. 2007 [24] KU 27 885311 Escueta-de Cadiz et al. 2010 [29] KU 50 885313 Escueta-de Cadiz et al. 2010 [29] KU 48 885312 Escueta-de Cadiz et al.

Cathepsin K

Cathepsin K is a cysteine protease expressed predominantly in osteoclasts

2 sequences (236T/G, 240A/G and 561T/G) or 5’ of the Amoebapore C