The Tree of Life: Adaptation and Convergence in Regulatory Systems, guest post by David Pollock

Below is another in my series on "The Story Behind the Paper" with guest posts from authors of open access papers of relevance to this blog. Today's post comes from David Pollock in Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine. Note - I went to graduate school with David and he is a long time friend. This is why he apparently feels okay to call me Jon even though I go by Jonathan. I have modified his text only to format it. Plus I added some figures from his paper.

Adaptation and Convergence in Regulatory Systems

Guest post by David Pollock

Explanation

This is a guest blog post about a paper of mine just published in Genome Biology and Evolution , "SP transcription factor paralogs and DNA bindingsites coevolve and adaptively converge in mammals and birds ", and it amounts to my first attempt at an un-press release. You can tell it's not a press release because I'm writing this about a week after the paper came out. This guest blog is coming about because I released an announcement about the paper to a few friends, and credited my old buddy Jon Eisen with having inspired me to move towards avoiding press releases. And I sure wouldn't want this paper to be named in the same breath as the now infamous arsenic bacteria and Encode project press releases (although I note that a recent paper from my lab estimates that the human genome is at least 2/3 derived from repetitive sequence); also see posts by Michael Eisen and by Larry Moran. I had the vague idea that maybe Jon, Ed Yong, Carl Zimmer, or some other famous blogger would be so inspired by the move (and the fundamental coolness of the paper) that they would quickly write up a thoughtful interpretive summary for general audiences. Jon, however, being much to smart for my own good, suggested that I should write it up as a guest blog instead. So, here I am. At least I got Jon to agree to edit it.

The fundamentals of the paper

Okay then, why do I think adaptation and convergence in regulatory systems is cool and important? Well, first because I think a lot of important changes in evolution have to have come about through regulatory evolution, and yet there are huge gaps in our knowledge of how this change might happen. And I say this as someone who has spent most of his career studying (and still believing in the importance of) sequence evolution. Second, a lot of people seem to think that evolution of whole regulatory systems should be hard, because there are so many interactions that would need to change at the same time. Remember, transcription factors can interact with hundreds of different binding sites to regulate hundreds of different proteins. It makes sense that evolution of such a system should be hard. In this paper, I think we go a long way towards demonstrating that this intuitive sense is wrong, that functional evolution of regulatory systems can happen quite easily, that it has happened in a system with around a thousand functional binding sites, and that some of the details of how it happens are really interesting.

Fig. 1. Evolution of SP transcription factors. (A) SP1 binds preferentially to the GC box in placental mammals and birds (red) and to the ancestral GA box consensus in other vertebrates (black). Modifications in binding motif preferences along the phylogeny are denoted by red-filled circles. ‘Variable regions’ in zinc finger 2 (zf2-VR), containing all non-conserved sites in zinc finger 2 within vertebrates, are shown for SP1, SP3, and SP4. Site –13 (highlighted) is putatively responsible for the change in SP1 binding preferences. (B) Zinc finger 2 (zf2) of human SP1, SP3, and SP4. Each zinc finger contains an alpha-helix and two beta sheets (Philipsen et al. 1999; Dhanasekaran et al. 2006). Red and gray columns denote sites non- conserved across vertebrates; all are contained in the boxed variable region (zf2-VR), comprising sites -13 to -8. Site +3 binds directly to the convergent A/C fourth site of the GC box. (C) SP1 binds to the DNA via zinc fingers 1-3 (zf1-zf3), where zf2 binds to the three central nucleotides of the GC box (GGGCGG) (Philipsen et al. 1999; Bouwman et al. 2002; Dhanasekaran et al. 2006). Site -13 (red) is only 9.5 Å from site +3 (green) and directly contacts the neighboring site (site +4) (Bouwman et al. 2002; Oka et al. 2004; Dhanasekaran et al. 2006).

Aside from promoting the science, the other reason I want to blog about this paper is that I think it is a great demonstration of how fun and how diverse science can be. To support its points, it brings in many different types of evidence, from genomics to population genetics to protein structure prediction. It is also a good example of using only publicly available data, plus a lot of novel analysis, to see something interesting that was just sitting there. I think there must be a lot more stories like that. All of that Encode data, for example, is bound to have some interesting undiscovered stories, even with 30 papers already published. The most fun part, though, which I don't think I can fully recreate without jumping up and down in front of you, was just the thrill of discovery, and the thrill of having so many predictions fall into place with data from so many different sources.

I don't recall another project where we would say so many times, "well, if that is the explanation, then let's look at this other thing," and bam!, we look at the other thing and it fits in too. It started with Ken Yokoyama, the first author, walking into the lab having just published some pretty good evidence that the preference for the SP1-associated binding site (the GC-box) was newly evolved in the ancestor to eutherian mammals. Well, if that's true, there ought to be a change in the SP1 protein sequence that can explain it. Sure enough, there is, and SP1 is a very conserved protein that doesn't change a lot. Hmm, we have more sequences now, let's look to see if preferences changed anywhere else on the phylogenetic tree. Yes, in birds. Well, there ought to be a change in bird SP1 that can explain that; sure enough, there is, and it's at the homologous position in the protein. Looking good, but is it in the right place in the protein? Yes, in the right domain (zinc finger 2, or zf2), right behind the alpha helix that binds the nucleotide for which the preference changed. And before you ask, Ken ran a protein structure prediction algorithm on the amino acid replacements in SP1, and the predicted functional replacements in bird and mammal are predicted to bend the protein right at the point where it binds the nucleotide at which the preference changed. You might then ask if this amino acid replacement does anything to the binding function, and the answer again is "yes". This time, though, we were able to rely on existing functional studies, which showed that human SP1 binds 3x better to the GC-box binding site than it does to the ancestral GA-box (more on this below).

Fig. 2. Birth-death rates of the SP1 binding motif in mammals. Birth rates (α) denote the probability (per year) that an unoccupied position will gain a binding site; death rates (β) give the probability (per year) that an existing binding site is lost. Branches in the mammalian phylogeny were partitioned into three groups: early eutherian mammals (red), late eutherian mammals (black), and GA box-preferring non- eutherian mammals (blue). Birth and death rates of each group were estimated for the GC box (GGGCGG), GA box (GGGAGG), and the non-functional motif GGGTGG (Letovsky et al. 1989; Wierstra 2008).

The coup de grace on this residue position as the source of convergent functional changes, though, came with consideration of the other transcription factors that interact in this regulatory system (that is, they bind to the same binding sites to modify transcription). If there was a functional change in the transcription factor, driving modified changes in the binding sites, then it seems that this should affect the other transcription factors in the system. It could have been hard to figure anything out about these other transcription factors, but luckily they consist of SP3 and SP4, two paralogs of SP1. This means that they are ancient duplicates of ancestral SP proteins, they share a great deal of conserved sequence with SP1, and they bind with similar affinities to the SP1 binding consensus. And they have not just one or two, but between the two proteins, in birds and mammals, at least eight convergent amino acid replacements at the homologous position that putatively modified binding in SP1. And the substitution that occurs is the same replacement that occurred in bird SP1. Based on sequences from jawed fish and frogs, this position was almost completely conserved in the SP3 and SP4 paralogs for 360 or 450 million years of evolution. The convergent changes all occurred in only the last 100 million years or less of eutherian and bird lineages. We believe that the simplest interpretation is that, over tens of millions of years, a functional replacement occurred at the SP1 protein, adaptively driving hundreds of SP1 binding sites to convert from ancestral GA-boxes to derived GC-boxes, and that this then drove the same functional replacement in coregulatory paralogs SP3 and SP4.

Timing and a mechanism

Two questions often comes up at this point, "how do we know the order of these events?" The simplest piece of evidence for the order of events comes from the order of fixation of substitutions. The amino acid substitutions in SP1 are fixed in all eutherian mammals and all birds, indicating that they occurred on the branches leading to these taxon groups. The increase in GC-boxes occurred over time at different loci, mostly on the branch leading to eutherian mammals and on the branches immediately after that split the most ancient eutherian mammal groups. The replacements in SP3 and SP4 occurred later in the evolution of eutherian mammals and birds, and did not occur in all lineages. One might be able to come up with complicated scenarios whereby changes in some SP1 binding sites occurred first, driving the fixation of the SP1 replacement, followed by further selected changes in other SP1 binding sites, but we think our hypothesis is simpler.

Fig. 3. Population frequencies of an adaptive mutant transcription factor and its binding sites. (NOTE - SOME DETAIL OF LEGEND LOST IN COPY/PASTE - SEE PAPER). (A) Shown are the population frequencies of the adaptive mutant transcription factor allele (blue), which first occurs in a single heterozygous individual at generation (population size:

). The total population frequency of the novel binding consensus (BOXC) and the initial wild-type binding motif (BOXA) are shown in red and black, respectively. We assume a small adaptive benefit for the adaptive transcription factor SPC binding to BOXC (relative fitness , where ) over the wild-type transcription factor and its motif (relative fitness

). Maladaptive binding events (SPC

binding to BOXA or the wild-type transcription factor binding to BOXC) have reduced fitness ( , where ). Population frequencies of SPC, BOXA, and BOXC are given on the left for the first 20,000 generations and on the right for 150,000 generations. (B) Evolution of the adaptive trans-factor and binding sites under a semi-dominant model. SPC binding to BOXC is assigned relative fitness for individuals heterozygous for the transcription factor genotype (

) and for individuals homozygous for the mutant transcription factor. (C) The single binding site locus model. In contrast to the previous model, each locus is restricted to no more than one binding motif (either BOXA or BOXC).

Other pieces of evidence also come into play. The question about the order of events can be rephrased as a question of whether neutral forces, such as changes in mutation rates at binding sites, could have altered the frequency of the alternative binding sites, with SP1 (and then SP3 and SP4) playing functional catch-up to better match the new binding site frequencies. It seems to us that such a model would predict that the binding sites would have changed irrespective to the function of the proteins that they regulate. (As an aside here, we note that our binding site data set is best described not as a definitive set of SP1-regulated promoters, but as a set that is highly enriched in functional SP1 binding sites. We don't trust binding site function predictions, and the putative binding sites inclusively considered were those that had either the ancestral GA-box or the derived GC-box in the functionally relevant region prior to the transcription start site. Such sites are highly enriched for categories of genes known to be under SP1 control.) But the binding sites that shift from GA-boxes to GC-boxes are even further enriched for categories of genes under SP1 control. This is not compatible with the neutral mutational shift model, but is compatible with the idea that the subset of our sites for which SP1 regulation is most important are the ones that were most likely to adaptively shift box type when the SP1 with altered function became more frequent.

The mutational driver model also predicts a simple shift in frequencies driven by mutation. For example, GA-boxes might tend to mutate into GC-boxes, and conversely, GC-boxes might tend to be conserved and not mutate to GA-boxes. What I haven't told you yet, though, is that the excess GC-boxes do not tend to be produced by mutation from GA-boxes, but rather they tend to be produced as de novo mutations from non-SP1 box sequences. They are produced by a wide variety of mutations from a wide variety of different sequences that are slightly different from the canonical SP1 binding sites. Furthermore, the GC-boxes appear in a burst of birth early in eutherian evolution, but the GA-boxes don't disappear in a burst at the same time. Rather, they simply fade away slowly over time in lineages that have evolved GC-boxes. It is not clear to us that this can be explained using a mutation model, but it is easily explained by a model in which the SP1 replacement has adaptively driven hundreds of binding site convergent events. This is then followed by the slow mutational degradation of the GA-boxes, which don't matter so much to function anymore. It is also worth mentioning that the GC box preference doesn't seem to correlate with GC content, as several fish lineages are just as high in GC content as humans, but do not have the GC-box preference.

Fig. 4. Structural changes of SP1 zinc finger 2 (zf2) following replacements at site -13. (Top) Comparisons of predicted lowest-energy zf2 structures between the native human peptide (-13M), and peptides following replacements to the ancestral valine (M-13V) and bird isoleucine (M-13I) at site -13. Structural alignments were conducted according to residues on the 5’ end of the peptide (residues -16 to -12). Both -13M and M-13I peptides showed displacement of residues 5’ to the DNA-contacting alpha-helix (sites -6 to -1) compared to the ancestral valine peptide. No such displacement was seen between -13M and M-13I. All three peptides aligned closely at the 3’ end of the alpha-helix (sites +6 to +10), reflecting structural modifications at the 5’ end of the alpha-helix. (Bottom) Distances between alpha carbons prior to and within the alpha-helix (blue and orange, respectively). Comparisons between the native human peptide and M-13V (left) and between M-13I and M-13V (center) show closely-aligned residues at the 3’ end of the alpha-helix and increasing displacement towards the 5’ end. These modifications begin around site +3, which directly contacts the A/C evolving site of the SP1 binding motif (Philipsen et al. 1999; Bouwman et al. 2002; Dhanasekaran et al. 2006). No such region- specific displacement between -13M and M-13I was observed between -13M and M-13I (right).

The observed pattern of binding site evolution is also predicted by a model we developed to determine if the evolution of transcription factors and their binding sites could be explained in a population genetics framework. We asked, what is a possible mechanism by which these changes might occur? At the beginning of this post, I noted that a lot of people seem to think that evolution of complex multi-genic regulatory systems should be hard. We reasoned, though, that if the beneficial effects of a newly evolved binding interaction were dominant or semi-dominant (that is, the beneficial effects in the heterozygote were at least partly visible to selection), then it might be possible for evolution to be achieved through a transition period in which both the transcription factor and its cognate binding sites were polymorphic.

We developed both a deterministic and a stochastic model, and found that, indeed, even small (nearly neutral) selective benefits per locus can drive the entire system to fixation. What happens is that as long as there are some binding loci with the new binding box in the system, then a new variant of SP1 with a preference for the new binding box, and an associated small selective benefit, will at first rapidly increase in frequency. It won't immediately fix though, but rather will maintain a temporary steady state, kept down in frequency by the deleterious effects that new variant homozygotes would have with the large number of binding loci that are homozygous for the ancestral binding box. Once it has reached this steady state, it exerts selective pressure on all the binding loci to increase the frequencies of the new binding boxes at each locus. Much more slowly then, the frequency of the new transcription factor variant increases, in step with the frequencies of new binding boxes at all the binding loci.

Although our studies do not prove that our population genetics model is the exact mechanism for adaptive changes in SP1, it provides proof of concept that it is not difficult for such a mechanism to exist.

Where does this leave us?

At the broader level, this paper shows that small selective benefits can drive the evolution of complex regulatory systems (in diploids, at least; sorry to leave out the micro folks, Jon). Furthermore, it demonstrates, we believe convincingly, that adaptation has driven the evolution of the SP1 regulatory system, driving convergent evolution at many hundreds of promoters, and in SP3 and SP4. It thus strongly counters prevailing notions that such evolution is hard. We hope that this work (along with other work of this kind) will drive others to further pursue the broad questions in regulatory evolution. Are the details of the SP1 system common to other regulatory systems?

A particularly important question, which we did not focus on here, is whether the evolution we have described involves only static maintenance of the status quo in terms of which genes are regulated. One has to wonder, though, whether if it is easy to evolve a static regulatory system, that it is not therefore easier than previously believed to modify regulatory connections in a complex regulatory system. There are hints of such changes here, in that genes that may have gained novel SP1 regulation (that is, gained a GC-box when they did not have the ancestral GA-box) tend to be enriched in certain GO categories (see Table 2 in the paper).

For SP1, it will be interesting to see if good stories can be developed for to explain why this adaptation should have occurred specifically in birds and eutherian mammals. The ideal story should include both a biophysical mechanism, and a physiology-based mechanism, such as the possibility that warm-bloodedness played a role. Both of these avenues promise to be complicated, if addressed properly. For example, we believe that it will be more meaningful if a biophysical mechanism can address the need for specificity as well as strength of binding, perhaps by utilizing next-generation sequencing approaches to measure affinities for all relevant binding site mutations (see, among others, our recent paper on this topic, Pollock et al., 2011). Are there interactive roles for selection on transcription factor concentration as well as efficiency and selectivity? What trade-offs exist among binding efficiency and binding site duplication? Do different types of regulatory connections evolve differently? These are all great questions for future research.

Addendum

I'll try to add further comments if questions or issues come up. I'm particularly interested to see how this non-press release guest log post works as an experiment to promote the paper and the work. I also hope it will promote Ken Yokoyama's career (he's now at Illinois, and will probably be looking for an academic job in the next year or two). He did an awesomely diverse amount of work on this, learning how to work in totally new areas for him, such as population genetics, birth/death models, and protein structure prediction. He dove into these areas unhesitatingly to pursue the logical scientific questions, developed novel analyses, and did a great job. This paper represents a fundamental contribution and a fantastic advertisement for Ken's abilities.