This is a guest blog post about a paper of mine just published in Genome Biology and Evolution, "SP transcription factor paralogs and DNA bindingsites coevolve and adaptively converge in mammals and birds", and it amounts to my first attempt at an un-press release. You can tell it's not a press release because I'm writing this about a week after the paper came out. This guest blog is coming about because I released an announcement about the paper to a few friends, and credited my old buddy Jon Eisen with having inspired me to move towards avoiding press releases. And I sure wouldn't want this paper to be named in the same breath as the now infamous arsenic bacteria and Encode project press releases (although I note that a recent paper from my lab estimates that the human genome is at least 2/3 derived from repetitive sequence); also see posts by Michael Eisen and by Larry Moran. I had the vague idea that maybe Jon, Ed Yong, Carl Zimmer, or some other famous blogger would be so inspired by the move (and the fundamental coolness of the paper) that they would quickly write up a thoughtful interpretive summary for general audiences. Jon, however, being much to smart for my own good, suggested that I should write it up as a guest blog instead. So, here I am. At least I got Jon to agree to edit it.
Okay then, why do I think adaptation and convergence in regulatory systems is cool and important? Well, first because I think a lot of important changes in evolution have to have come about through regulatory evolution, and yet there are huge gaps in our knowledge of how this change might happen. And I say this as someone who has spent most of his career studying (and still believing in the importance of) sequence evolution. Second, a lot of people seem to think that evolution of whole regulatory systems should be hard, because there are so many interactions that would need to change at the same time. Remember, transcription factors can interact with hundreds of different binding sites to regulate hundreds of different proteins. It makes sense that evolution of such a system should be hard. In this paper, I think we go a long way towards demonstrating that this intuitive sense is wrong, that functional evolution of regulatory systems can happen quite easily, that it has happened in a system with around a thousand functional binding sites, and that some of the details of how it happens are really interesting.
Aside from promoting the science, the other reason I want to blog about this paper is that I think it is a great demonstration of how fun and how diverse science can be. To support its points, it brings in many different types of evidence, from genomics to population genetics to protein structure prediction. It is also a good example of using only publicly available data, plus a lot of novel analysis, to see something interesting that was just sitting there. I think there must be a lot more stories like that. All of that Encode data, for example, is bound to have some interesting undiscovered stories, even with 30 papers already published. The most fun part, though, which I don't think I can fully recreate without jumping up and down in front of you, was just the thrill of discovery, and the thrill of having so many predictions fall into place with data from so many different sources.
I don't recall another project where we would say so many times, "well, if that is the explanation, then let's look at this other thing," and bam!, we look at the other thing and it fits in too. It started with Ken Yokoyama, the first author, walking into the lab having just published some pretty good evidence that the preference for the SP1-associated binding site (the GC-box) was newly evolved in the ancestor to eutherian mammals. Well, if that's true, there ought to be a change in the SP1 protein sequence that can explain it. Sure enough, there is, and SP1 is a very conserved protein that doesn't change a lot. Hmm, we have more sequences now, let's look to see if preferences changed anywhere else on the phylogenetic tree. Yes, in birds. Well, there ought to be a change in bird SP1 that can explain that; sure enough, there is, and it's at the homologous position in the protein. Looking good, but is it in the right place in the protein? Yes, in the right domain (zinc finger 2, or zf2), right behind the alpha helix that binds the nucleotide for which the preference changed. And before you ask, Ken ran a protein structure prediction algorithm on the amino acid replacements in SP1, and the predicted functional replacements in bird and mammal are predicted to bend the protein right at the point where it binds the nucleotide at which the preference changed. You might then ask if this amino acid replacement does anything to the binding function, and the answer again is "yes". This time, though, we were able to rely on existing functional studies, which showed that human SP1 binds 3x better to the GC-box binding site than it does to the ancestral GA-box (more on this below).
Timing and a mechanism
Two questions often comes up at this point, "how do we know the order of these events?" The simplest piece of evidence for the order of events comes from the order of fixation of substitutions. The amino acid substitutions in SP1 are fixed in all eutherian mammals and all birds, indicating that they occurred on the branches leading to these taxon groups. The increase in GC-boxes occurred over time at different loci, mostly on the branch leading to eutherian mammals and on the branches immediately after that split the most ancient eutherian mammal groups. The replacements in SP3 and SP4 occurred later in the evolution of eutherian mammals and birds, and did not occur in all lineages. One might be able to come up with complicated scenarios whereby changes in some SP1 binding sites occurred first, driving the fixation of the SP1 replacement, followed by further selected changes in other SP1 binding sites, but we think our hypothesis is simpler.
Other pieces of evidence also come into play. The question about the order of events can be rephrased as a question of whether neutral forces, such as changes in mutation rates at binding sites, could have altered the frequency of the alternative binding sites, with SP1 (and then SP3 and SP4) playing functional catch-up to better match the new binding site frequencies. It seems to us that such a model would predict that the binding sites would have changed irrespective to the function of the proteins that they regulate. (As an aside here, we note that our binding site data set is best described not as a definitive set of SP1-regulated promoters, but as a set that is highly enriched in functional SP1 binding sites. We don't trust binding site function predictions, and the putative binding sites inclusively considered were those that had either the ancestral GA-box or the derived GC-box in the functionally relevant region prior to the transcription start site. Such sites are highly enriched for categories of genes known to be under SP1 control.) But the binding sites that shift from GA-boxes to GC-boxes are even further enriched for categories of genes under SP1 control. This is not compatible with the neutral mutational shift model, but is compatible with the idea that the subset of our sites for which SP1 regulation is most important are the ones that were most likely to adaptively shift box type when the SP1 with altered function became more frequent.
The mutational driver model also predicts a simple shift in frequencies driven by mutation. For example, GA-boxes might tend to mutate into GC-boxes, and conversely, GC-boxes might tend to be conserved and not mutate to GA-boxes. What I haven't told you yet, though, is that the excess GC-boxes do not tend to be produced by mutation from GA-boxes, but rather they tend to be produced as de novo mutations from non-SP1 box sequences. They are produced by a wide variety of mutations from a wide variety of different sequences that are slightly different from the canonical SP1 binding sites. Furthermore, the GC-boxes appear in a burst of birth early in eutherian evolution, but the GA-boxes don't disappear in a burst at the same time. Rather, they simply fade away slowly over time in lineages that have evolved GC-boxes. It is not clear to us that this can be explained using a mutation model, but it is easily explained by a model in which the SP1 replacement has adaptively driven hundreds of binding site convergent events. This is then followed by the slow mutational degradation of the GA-boxes, which don't matter so much to function anymore. It is also worth mentioning that the GC box preference doesn't seem to correlate with GC content, as several fish lineages are just as high in GC content as humans, but do not have the GC-box preference.
The observed pattern of binding site evolution is also predicted by a model we developed to determine if the evolution of transcription factors and their binding sites could be explained in a population genetics framework. We asked, what is a possible mechanism by which these changes might occur? At the beginning of this post, I noted that a lot of people seem to think that evolution of complex multi-genic regulatory systems should be hard. We reasoned, though, that if the beneficial effects of a newly evolved binding interaction were dominant or semi-dominant (that is, the beneficial effects in the heterozygote were at least partly visible to selection), then it might be possible for evolution to be achieved through a transition period in which both the transcription factor and its cognate binding sites were polymorphic.
We developed both a deterministic and a stochastic model, and found that, indeed, even small (nearly neutral) selective benefits per locus can drive the entire system to fixation. What happens is that as long as there are some binding loci with the new binding box in the system, then a new variant of SP1 with a preference for the new binding box, and an associated small selective benefit, will at first rapidly increase in frequency. It won't immediately fix though, but rather will maintain a temporary steady state, kept down in frequency by the deleterious effects that new variant homozygotes would have with the large number of binding loci that are homozygous for the ancestral binding box. Once it has reached this steady state, it exerts selective pressure on all the binding loci to increase the frequencies of the new binding boxes at each locus. Much more slowly then, the frequency of the new transcription factor variant increases, in step with the frequencies of new binding boxes at all the binding loci.
Although our studies do not prove that our population genetics model is the exact mechanism for adaptive changes in SP1, it provides proof of concept that it is not difficult for such a mechanism to exist.
At the broader level, this paper shows that small selective benefits can drive the evolution of complex regulatory systems (in diploids, at least; sorry to leave out the micro folks, Jon). Furthermore, it demonstrates, we believe convincingly, that adaptation has driven the evolution of the SP1 regulatory system, driving convergent evolution at many hundreds of promoters, and in SP3 and SP4. It thus strongly counters prevailing notions that such evolution is hard. We hope that this work (along with other work of this kind) will drive others to further pursue the broad questions in regulatory evolution. Are the details of the SP1 system common to other regulatory systems?
A particularly important question, which we did not focus on here, is whether the evolution we have described involves only static maintenance of the status quo in terms of which genes are regulated. One has to wonder, though, whether if it is easy to evolve a static regulatory system, that it is not therefore easier than previously believed to modify regulatory connections in a complex regulatory system. There are hints of such changes here, in that genes that may have gained novel SP1 regulation (that is, gained a GC-box when they did not have the ancestral GA-box) tend to be enriched in certain GO categories (see Table 2 in the paper).
For SP1, it will be interesting to see if good stories can be developed for to explain why this adaptation should have occurred specifically in birds and eutherian mammals. The ideal story should include both a biophysical mechanism, and a physiology-based mechanism, such as the possibility that warm-bloodedness played a role. Both of these avenues promise to be complicated, if addressed properly. For example, we believe that it will be more meaningful if a biophysical mechanism can address the need for specificity as well as strength of binding, perhaps by utilizing next-generation sequencing approaches to measure affinities for all relevant binding site mutations (see, among others, our recent paper on this topic, Pollock et al., 2011). Are there interactive roles for selection on transcription factor concentration as well as efficiency and selectivity? What trade-offs exist among binding efficiency and binding site duplication? Do different types of regulatory connections evolve differently? These are all great questions for future research.
I'll try to add further comments if questions or issues come up. I'm particularly interested to see how this non-press release guest log post works as an experiment to promote the paper and the work. I also hope it will promote Ken Yokoyama's career (he's now at Illinois, and will probably be looking for an academic job in the next year or two). He did an awesomely diverse amount of work on this, learning how to work in totally new areas for him, such as population genetics, birth/death models, and protein structure prediction. He dove into these areas unhesitatingly to pursue the logical scientific questions, developed novel analyses, and did a great job. This paper represents a fundamental contribution and a fantastic advertisement for Ken's abilities.