What’s all the fuss about base composition ?

Every genome is made of only four nucleotides : Adenine, Thymine, Guanine and Cytosine. By convention, biologists describe base composition of genomes in term of GC content, or GC%.

$$ GC\% = \frac{GC}{AT + GC} $$

Base composition is a fundamental genome trait. It typically varies many fold over the genome, and between genome. Since fifty years, microbial GC-content diversity has been described1. Now that we can massively sequence a lot of genome, we know that GC% is even more variable than expected. Extremely reduced genomes of insect endosymbionts typically have low GC%. Carsonella rudii is the most AT rich genome sequenced to date. On the other side of the distribution, large genomes of soil bacteria have GC content around 70%, like Streptomyces.

The origin of such variability has been the topic of many debates since fifty years. Some authors have proposed that GC% is under selection, that the GC% contributes to individual fitness in a given environment. Several lines of evidence tends to indicate that environmental parameters may be related to GC%, like oxygen or nitrogen availability, or optimal growth temperature2. But correlations between such factors and base composition are weak at best, and are often not robust to analysis in a powerful phylogenetic framework, ie taking phylogenetic inertia into account. Thus, the favoured hypothesis since 1962 and the famous work of Sueoka et al, was that divergence in mutational bias patterns was the main determinants of base composition variability. Sueoka proposed that GC-rich genomes may have more mutations toward G and C, while AT-rich genomes may have more mutations toward A and T. This hypothesis have popularized the idea that base composition is mostly determined by a neutral process : mutation.

2010 : Mutational bias hypothesis shattered

Sueoka’s idea was the dominant school of thought during fifty years. This hypothesis was broken into pieces by one issue of Plos Genetics3, in which two famous geneticists team independently and simultaneously demonstrated that mutation is universally biased towards AT, even in GC-rich genomes. They also showed that some kind of evolutionary force tends to increase GC content in all genomes. Consequences of this two paper may well represent a “seismic shift of paradigm” in microbial evolution, as coined by Rocha & Feil4 in their review paper. It can only mean one thing : some selection or selection-like process is increasing GC content in bacteria.

Why not GC-biased gene conversion ?

In eucaryotes, it has been shown that recombination affects GC content. Recombination involves pairing of homologous strands of DNA into a structure called an heteroduplex. Mismatch between strands are corrected by a specific mechanism, which tends to introduce more GC bases than AT. The process of mismatch repair leads to gene conversion, ie the unidirectional transfer of genetic information from the donor strand to the corrected strand. In mammals and probably a great deal of eucaryotic organisms, gene conversion is biased towards GC, which means that GC:AT mismatch are more often corrected into GC than AT. The GC-biased gene conversion (gBGC) consequences have been thoroughly reviewed by Duret & Galtier in this 2009 paper. The major consequences is that recombination hotspots (intensely recombining genomic regions) are GC-richer than coldspots. One can draw a very strong correlation between GC-content of a genomic region and its crossover-rate5.

By increasing GC-alleles fixation probability over AT-alleles, gBGC has a selection-like signature in genomes. When the two 2010 papers came out, my supervisor thought that maybe gene conversion is also biased toward GC in bacteria ! Florent Lassalle, during is PhD, thus looked for traces of recombination in bacterial genomes, using comparative genomic tools6. He binned genomes into twenty distinct GC% bins, and found that the proportion of recombining genes in a given bin is correlated to its GC%. He also found that GC% is higher in recombining genes, and even higher at third codon positions. Due to genetic code redundancy, the third codon position is less strongly subject to purifying selection.

All these observations are compatible with the gBGC hypothesis.

What’s my job ?

But observations only are what they are : observations. It would be great if we could experimentally test that gene conversion is biased or not in bacterial models. My first goal is thus to measure the frequency of conversion of GC:AT mismatch toward GC and AT. I will force mismatch by using synthetic sequences that introduces single nucleotide variants at a given locus, and sequence the recombination products. We need many recombining bacteria. We chose to use natural transformation in Acinetobacter baylyi, known for its high transformation frequencies.

My second goal is to estimate fundamental recombination related traits, like recombination rate per base and global effective population size. We want to confront those measures with known GC-content affecting environmental traits and fundamental genome properties, like genome size. The idea is to check for recombination related confounding patterns in previously advanced GC%-determining traits.

Maybe we’ll then have a more solid overview of what may be at stake in this ever evading property of bacterial genomes…


  1. See for example the work of Hill et al, 1966 [return]
  2. Optimal Growth Temperature was thought to be positively correlated to GC% in the nineties, but this theory has been refuted since. See the work of Galtier et al, 1997. They demonstrated that only structural RNA-encoding DNA GC% is correlated to GC%. There is no significant correlation between GC content and OGT. [return]
  3. See this work by Hildebrand et al, and this work by Hershberg & Petrov. [return]
  4. See here. [return]
  5. See this work by Duret & Arndt on the impact of recombination on nucleotide substitutions in the human genome. [return]
  6. His paper has been published in Plos Genetics in 2015. [return]