Using LANL GFP to find Soluble Domains in Proteins
The motivation behind inventing a fast reliable system for identifying the individual soluble domains in proteins was guided by the following facts:
LANL's solution is to use our highly evolved version of GFP to approach this problem differently. Waldo has designed a system which is based on a library of fragmented genes that are then expressed (Fig. 1, # 1 and #2). Using a version of our GFP to determine solubility, the resulting proteins from the library can then be rapidly screened to find those that are soluble (Fig. 1, #3).
Our process involved four different steps that takes approximately seven business days to go from gene sequence(s) to identified soluble domains. This turnaround time is unbelievable considering the amount of time and effort other processes require to discover the same information. The four individual steps are described in great detail below.
Step 1 (Library size ≥107)
The first step of the LANL process is the fragmentation of a gene or genes of interest. Keep in mind, with the assistance of robotics, this process could easily be scaled to full genome analysis.
The first step of our process is to take a set of genes or ORFs and fragment them using any number of methods such as restriction enzyme treatment (e.g. DNAse) or mechanically disrupting them with shearing forces (Fig 2, #2). If the size is known of the gene or set of genes, the gene fragments are run on an agarose gel and the fragments of the correct size can be cut out and taken into the next step of the process for focused study.
Fragments are then blunt cloned and screened. However, since >95% of expressed proteins are not in the native frame and likely have a stop codon (Fig. 3, #1), the DNA fragments are cloned into the LANL ORF selector before spending valuable time and effort on the GFP screen. Using the LANL ORF selector, the test fragment is inserted between two halves of dihydrofolate reductase enzyme (DHFR)
to determine which inserts are out-of-frame (Fig. 3).
If the inserted fragment is out of frame, it is likely to have one or more stop codons, so the second half of DHFR selection gene will not be expressed. The result is that when the host E. coli cells are plated on a low concentration of trimethoprim, the clones with out-of-frame constructs cannot survive. Clones without stop codons (the fragment is in-frame) can produce DHFR to metabolize trimethoprim and survive (Fig. 4, image 2). These plasmids are then recovered and investigated further (Fig. 4, image 3, the plasmids with green inserts).
Keep in mind, in this step solubility has not been determined yet. This step simply removes those DNA fragments which do not encode authentic protein domains.
Step 3 (Library size approx. 105)
Once the collection of fragments that are known to express viable protein domains is identified, the in-frame inserts can then be cloned into LANL's in vivo Split GFP system to determine which in-frame, expressed domains are soluble (Fig. 4, #1). The Split GFP constructs are transformed and grown in E. coli and single colonies are screened by plating on agar plates. First, expression of the S11-tagged protein is induced, followed by induction of the GFP detector fragment (S1-10)(Fig. 5, left).
Under these conditions, brighter the clones indicate that the S11-tagged, soluble protein is interacting with the S1-10 detector fragment and fluorescence is being produced. The brightest clones are then picked from the agar plates and grown in 96 well liquid cultures. By inducing only the AnTET promoter in the 96-well plate cultures, only the S11-tagged protein is expressed. The cells are lysed, and the soluble and insoluble protein products quantified by adding (not inducing the expression of) the in vitro S1-10 detector fragment.
Briefly, the Split GFP system is based on fragmenting a highly modified GFP into two separate, soluble pieces: GFP strand 11 (S11), also known as the "tether" of "tag" and GFP strands 1-10 (S1-10), known as the "detector fragment" (Fig. 6).
First, the S11 tagged protein domain is expressed using AnTET induction. Then the S1-10 detector fragment is expressed using IPTG induction. If the expressed domain is soluble, it will allow the S11 "tether" to interact with the larger GFP fragment, S1-10. The interaction of S11 and S1-10 allows GFP to fluoresce (Fig. 6).
The beauty of the LANL GFP system is that there is no need to reclone the construct to quantify expression. To express the S11 tag alone, one simply adds only AnTET which induces expression of the S-11 tagged protein and the S1-10 can be added as an in vitro reagent!
Step 4 (Library size 102)
In the final step of LANL's process, clones of interest are selected from the 96-well plate used in Step 3 and are sequenced (Fig. 7, # 1). At LANL, typically the entire 96-well plate is sequenced so that the position of all the fragment clones can be mapped to the parent gene.
In silico, the sequence of the gene fragments are aligned onto the full parent gene and color coded by solubility. At LANL, we use the color scheme, red-orange-yellow-green-blue-bluish white, where the red side of the spectrum identifies the least soluble domains.). This makes finding the experimental domain boundaries very clear and makes it easy to identify compact members of the groups (a.k.a the minimal tiling path) which are subcloned for scale up (Fig. 7, #2). For crystallographic applications, the clone is subcloned without the S11 tag, or the tag can be included as a way to track purification, and then cleaved off.
Crystallizing a 2200 aa protein from Mycobacterium tuberculosis
If the intent is to crystallize the identified soluble domains as we did with a 2200 aa protein called ssPC, LANL recommends selecting multiple compact versions of each soluble domain to increase the probability of finding crystallizable, diffracting constructs (Typical yields 15-30 mg/l, concentrate to >40 mg/ml). Figure 8 shows an SDS gel of some of the most compact (smallest in set) soluble fragments or soluble domains representing the 6 predicted domains of ppSC, subcloned into a pET N-terminal 6HIS vector. Soluble (S), insoluble pellet fraction (P) of E. coli lysates.
One of the advantages of having a dense sampling of fragment position and solubility is that it is easy to identify the boundaries of a domain. As fragments get progressively shorter, they become soluble near the boundary domains, then suddenly become less soluble as the fragments are further truncated. This makes it easier to select a small subset of compact clones for detailed study. Shown in Fig. 9 are some compact ‘double domain’ constructs containing the KR+ACP domains of the large polyketide synthetase from Mycobacterium tuberculosis (M. tb). The protein is nicely monodisperse by gel filtration chromatography and runs as a monomer without significant aggregation at >10 mg/ml.
Compact clones are less likely to contain disordered ends, and having several choices near a given boundary size increases the chance that at least one will crystallize and give diffraction-quality crystals. The split GFP domain trapping protocol readily identifies two sets of fragments, one focusing on a larger version of the ER-containing domain from ca. 1480-1755, and another more compact version focusing down to amino acids 1558-1750. This fragment crystallized and is very similar to the previously published construct Shapiro et. al used to solve the structure (1PQW). (Fig. 10)