Sequence synthesis problem


Originally posted to the DIYbio google group


I am currently looking for another commercial gene synthesis company that can do this, and other similar projects involving promoters.

Just posting this because it was a surprise that the company could not synthesize this sequence. The sequence in question is below, it is a promoter, 1 kb upstream of the histone H1 gene of the ascomycete fungus Neurospora crassa, which I wanted to modify in vivo after re-introducing it back into crassa upstream of a GFP reporter. I sent it to Operon and they could not synthesize it due to some repetitive/simple sequence regions (these regions they noted are also pasted below). they tried for a month. This surprised me because I would think that one DNA element a lot of people would want synthesized are promoters that they can then hook up to a gene of interest, and there are going to be a lot of people working with simple microbes (E cole, yeast, fungi) doing this since they are easier to genetically manipulate and have relatively compact promoters, and all promoters, by their nature, are going to have some structure. I will likey send it to somebody else to try, Epoch has been mentioned in this group, to see if they can do it. Anybody with experience at getting promoters synthesized? It did not really occur to me that this would be an issue, maybe the company just sucks at this. Anybody doing larger synthetic projects has to have run into this issue and overcome it. Part of the problem may be that they thought I was getting a gene synthesized; didnt think to mention it was a promoter but that should not matter as I was not asking them to optimize it in any way.


Sequence:
>UPSTREAM1000_histone H1
TGAAGAAGAGAGATAAATAGATAAATAGAAGATGATGGGATGGGCGAGATACCGACTGAA
TCTGAGAGATGGGATGCGGGATGGATGGATTGATGCCCTGCGGCTTCGATTGTGCCAACC
CAGCCAGCCAGCCAGCCATGCCGACCGACCGACCGATATGCGACATGCCATGCCACCTCA
CACACACAGGCACACTGATAACCTGCGAGTCGACGAGGAGAGGTGACGGGCGGCAGAACA
TGCTTCTTTTGGGGCCAGTGAATGATGCCGTGTCCCACCTTGGATCATCCAATCTGTCCG
GACCAGACTCCATCTGGAATGGACATCCATCGGCATCCGCACTCCCCTGGACCCCAATCG
GTTTCTAAAAAGAGGAGAAACGGAAGAGGAAAAGGGAAAGGGAAAAAAAAAAAGAAGAAC
AAGTGGGATCGATGGGACATGGGACAGCACCACTGCATCTCCAGCCGAGTCCATGGAAAC
GGGAAGACAAGGGGAGGGGGGGGGAGAGGGAGAGGGAGGAGGGGGAAGGGGAAGGAGAAG
GGGATGGGGATGGCGACCGAGAGGATAGGTACCTACTGTAGGGACGGGAAATCTCATCGA
CAACCACACAACGAAGCATCGATGCTCTCGAGGTCTCTTCCCCTTCCTTCATGAGACAAG
CGAAAAGGAAAAGGTCCGGAGCCCCAGCTTCCACATCGTGTTGACATGGAACGAGGGAAC
AGGAATCGGGGCCACTGGCCGGCTTCTTTCGTTCTTTCAGCGTGTGTTAGTGGGGTGCAC
GGGCCACATATCCCCGGGAAATGGGCTGGGGGTAGCGGCTTCCAGGAGGTCACAGAGGCC
CCCCCCCCCAGGTCGCAGGGGGAGACGGGAGGTCCGTCGGGGCAGGGGCAGGGAAGAATC
AGCGAAATCACTCGGTCGCGCCAGGAGACCCCGCCTCCGTATATAAACACCCAATCTTCC
CCCCTCGAGCGCGACTGAGCCCACCCATCCTCCTCTCGTC


Dear Thomas  Randall,
 

Your gene sequence exhibited the following properties:
 

1.     A codon adaptation index (CAI) of 0.75 (typically should be between 1.0–0.8)
2.     GC content of approximately 58.61% (typically should be between 30–70%)
3.     Number of CpG = 59
4.     Percentage of low frequency codons based on an E.coli host organism is 21% (this is decent)
5.     Direct repeats = 8
6.     Negative cis-acting elements = 1
 

All genes submitted to Eurofins MWG Operon for sequencing usually undergo optimization prior to synthesis, so I think your problem is more inherent in the design of your gene. I have asked for a more in depth explanation from the lab on your order failure. Once I get that information I will be better equipped to provide alternative options for your design.
 

Pertaining to good bioinformatics gene optimization and design tools, here are a few links:
 

https://www.dna20.com/genedesigner2/    (DNA 2.0 Gene Design Software Beta version – requires an account)
http://www.geneinfinity.org/sp/sp_motif.html#patterns  (Repository of free online servers for performing various sequence screens and comparisons)
http://www.bitgene.com/index.shtml (open source basic gene analysis and synthesis tool)
http://www.geneius.de/GENEius/Security_login.action (Eurofins MWG Operon gene optimization free online tool)
A few days later:


 Dear Thomas Randall,
 

Our lab technicians were able to pull up more information pertaining to your gene (please see below). From the query, the gene sequence didn’t seem to have that many regions of complexity to cause the failure observed, so it’s difficult to pinpoint exactly how to improve upon the sequence. It is possible the presence of folding motifs or the GC rich portion interfered with the correct assembly of the gene (we were unable to obtain a clone with the correct gene sequence). Also, there were two ORFs for this gene sequence, so that could have factored in to the ambiguity of the clones.
 

I hope this helps.
 

Result
 

Sequence Length
 

GC-Content
1014 bps
 

57.99%
 

Direct Repeats
 

Direct Repeat 1
1. Position: 128
2. Position: 132
Length: 15
Mismatches: 0
 

ccagccagccagcca
ccagccagccagcca
 

Direct Repeat 2
1. Position: 149
2. Position: 153
Length: 12
Mismatches: 0
 

ccgaccgaccga
ccgaccgaccga
 

Inverted Repeats
 

Inverted Repeat 1
1. Position: 623
2. Position: 623
Length: 12
Mismatches: 0
 

agcatcgatgct
agcatcgatgct
 

GC-Rich Subsequences
 

Sequence 1
Position: 800
GC-Content: 75%
 

ccccgggaaatgggctgggggtagcggcttccaggaggtcacagaggcccccccccccaggtcgcagggggagacgggaggtccgtcggggcaggggcag
 

Homopolymers
 

Homopolymer 1
Position: 411
Length: 11
 

aaaaaaaaaaa
 

Homopolymer 2
Position: 504
Length: 9
 

ggggggggg
 

Homopolymer 3
Position: 847
Length: 11
 

ccccccccccc 

 NGS Illumina Sequencing


Current project, map the sequence variation(s) underlying the col-4 mutation in N. crassastrain TR1 a his-2; mtr col-4 (single ascospore progeny of FGSC 3017 (a his-2; mtr col-4) and FGSC 2489 (OR74 A). col-4 is genetically defined as being 1 map unit (around 40 kb in this region) right of mtr on LG IV (chr4).

DNA sample (~ 50 ug) sent to Operon for 1 lane of 100 bp paired end sequencing on a HiSeq machine.

QC of DNA sample.


Flash Gel from Operon of the original undigested DNA sample, some degradation, but since the DNA sample will be sheared this did not matter.

QC of DNA sample after size selection of library.

the region around 500 bp was used for library preparation.

Data arrived on an external HD, two ~36 Gb fastq files each containing ~ 145 million 100 bp reads.


My QC of the sequence data done with Fastqc (free from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) installed on my MacBook Air.

Paired end 1

Paired end 2


Really good quality in both cases, interpretation of this figure can be found at the Babraham website (y axis is quality score; higher the better).

Mapping and Analysis of data (so far).

Initial manipulation of the data and the Fastqc analysis was done on a MacBook Air (one file at a time, this is not a laptop really appropriate for such large datasets; 250 Gb hard drive). All the computationally intensive analyses were done on the publicly available Galaxy bioinformatics website (https://main.g2.bx.psu.edu/), but could have been done on command line with the freely available bioinformatics tools mentioned below. Essentially it was mapping the reads to the chr 4 reference sequence of N. crassa from the Broad Institute (http://www.broadinstitute.org/annotation/genome/neurospora/MultiHome.html) with bwa (http://bio-bwa.sourceforge.net/) followed by SNP/indel analysis with the pileup function of samtools (http://samtools.sourceforge.net/).


The original data represents ~600 X coverage of the 40 Mb N. crassa genome. Approx 100 X coverage was taken from this set as sufficient for identification of SNPs and indels (just the first 20 M reads from each paired end file; reads are in the same order in each). The smaller dataset is also more manageable with the above mentioned analysis tools. All analyses below were confirmed with a different set of 20 M reads.

Existing genetic mapping data, and location of the col-4 mutation relative to known genes, helped guide where to look for SNPs/indels, approximately between coordinates 1,827,271 (mtr) and ~1,900,000 (a breakpoint between the reference sequence, in which the col-4 gene is known to reside, and adjacent sequence from a different genetic background, easily discernable from a jump in SNPs/kb). Two candidate SNPs and one indel were identified all within a 4 kb region in a contiguous stretch of approx ~70 kb of reference sequence. No other variation from reference sequence was found. A sample of the raw pileup output (which goes on for ~ 6 M rows or so covering the entire chr 4) is below:

SNP pileup

col 1 - line count from text editor
col 2 -name of the chr4 ref sequence
col 3 - bp coordinates of the ref sequence on chr4
col 4 - reference base call
col 5 - sequence coverage at that base
the rest - individual base calls for each read, with "." meaning reference base, the meaning of some of other characters can be found in the samtools documentation, others, like the alternative bp calls, are obvious.

The main point is a SNP (and N. crassa is haploid) really sticks out. Clearly at position 1,842,866 there is a A to G change in my sequenced strain.

As all three of the variations identified are intergenic, the one above, which is ~100 upstream of a candidate gene is what I will first focus on, actual molecular genetics expts still need to be done to test for the causative variation.

In addition to the above novel variations, the strain sequenced had additional markers that could be confirmed by this sequencing. The strain is mating type a (N. crassa is either A or a), his-2 (histidine auxotroph), and mtr (deficient in the neutral amino acid permease). Mutations in both of these genes were found. For mtr, a C>G change causing a Tyr290 > TAG (stop codon) at ref coordinate 1,826,174 and for his-2 (on chr1, a separate alignment) a G>C at ref coordinate 4,256,930 causing a nonsynonymous Ala > Pro (A32P). Good to have confirmation of known differences.


image4

Sanger Sequencing (old stuff)


Preparation of DNA templates for sequencing


DNA fragments
Run preparative gel and isolate fragment from gel and purify with Qiagen (or similar) column. Definitely follow the Qiagen suggestions for purifying fragments to be sequenced.
Elute with ddH2O only (keeping fragment away from EDTA is important for subsequent sequencing so no EDTA containing buffers)
Ethanol ppt DNA as above DNA is not free from buffer impurities and resuspend in ddH2O.
Check concentration on a gel containing a dilution series of a DNA marker of known concentration to roughly quantitate DNA concentration.

Final prep for sequencing reaction :
Resuspend fragment (10 ng / 100 bp fragment, so 100 ng for a 1 kb fragment) in final vol of 20 ul for the UNC sequencing facility along with 10 pmoles of appropriate primer in a 0.5 ml PCR tube and label appropriately. The concentrations suggested for both DNA and primers would be specific to the DNA sequencing facility which one is using.


Plasmid DNA
Do 5 ml miniprep in TB plus amp overnight, 37C.
Purify DNA with Qiagen (or similar) column and elute in ddH2O.
PEG purification of plasmid DNA (best for quality sequencing results).
Add 1 vol of 1.6 M NaCl/13% PEG 8000 to DNA.
Incubate on ice for 30 min.
Spin down at 13K for 15 minutes.
Rinse in 70% EtOH and air dry.
Resuspend in 20-50 ul ddH2O.
Quantitation: Do a Eco RI digest of DNA and run alongside a dilution series of a DNA marker of known concentration to roughly quantitate DNA concentration.

Final prep for sequencing reaction:
Put 700 ng plasmid DNA and 10 pmol of primer in a final volume of 20 ul ddH2O for the UNC sequencing facility in a 0.5 ml PCR tube.

An example sequence .
Example chromatographs from a plasmid DNA template. pGREEN (Carolina Biological) was used as DNA template and miniprepped as above in order to sequence the GFP gene insert. This was done to determine if the insert corresponded to the GFP ORF in a plasmid called pGREEN described in Genbank accession (AB124780) or the version of GFP suggested by an email inquiry from Carolina Biological from US Patent 6027881; sequence ID 16 (SG11). Carolina Biological apparently does not offer (or have) the full sequence of their pGREEN plasmid available on their website; I was referred to this patent when requesting the sequence by email.

chromatographs - these are raw, unedited chromatograph data. Sequence 1&3 used the M13F primer (GTAAAACGACGGCCAG) and 2&4 used the M13R primer (CAGGAAACAGCTATGAC), the template was the pGREEN plasmid purified as described above. These chromatographs were manually edited and assembled using the default assembly parameters of Sequencher. A complete ORF of the GFP gene is found on the reverse strand.
1
2
3
4

The fasta files for the DNA seq and ORF from pGREEN after assembly of the chromatographs above in Sequencher are below.
>pGreen_consensus_seq
CACTACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCGACGCGTGGCTCCTCAGTTGTACAGTTCA
TCCATGCCATGTGTAATCCCAGCAGCTGTTACAAACTCAAGAAGGACCATGTGGTCTCTCTTTTCGTTGG
GATCTTTCGAAAGGGCAGATTGTGTGGACAGGTAATGGTTGTCTGGTAAAAGGACAGGGCCATCGCCAAT
TGGAGTATTTTGTTGATAATGGTCTGCTAGTTGAACGCTTCCATCTTCAATGTTGTGGCGGGTCTTGAAG
TTCACTTTGATTCCATTCTTTTGTTTGTCTGCCATGATGTATACATTGTGTGAGTTATAGTTGTATTCCA
ATTTGTGTCCCAGAATGTTGCCATCTTCCTTGAAGTCAATACCTTTTAACTCGATTCTATTAACAAGGGT
ATCACCTTCAAACTTGACTTCAGCACGTGTCTTGTAGTTGCCGTCATCTTTGAAGAAGATGGTCCTTTCC
TGTACATAACCTTCGGGCATGGCACTCTTGAAAAAGTCATGCCGTTTCATATGATCCGGGTATCTTGAAA
AGCATTGAACACCATAGCACAGAGTAGTGACTAGTGTTGGCCATGGAACAGGCAGTTTGCCAGTAGTGCA
GATGAACTTCAGGGTAAGTTTTCCGTATGTTGCATCACCTTCACCCTCTCCACTGACAGAGAACTTGTGG
CCGTTAACATCACCATCTAATTCAACAAGAATTGGGACAACTCCAGTGAAGAGTTCTTCTCCTTTGCTAG
CCATACTTTATCTAGAGTCGACCCTAGAGCTTTTGTTCCCTTTAGTAGGGTTAA
>GFPseq_ORF
ATGGCTAGCAAAGGAGAAGAACTCTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTA
ACGGCCACAAGTTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAAAACTTACCCTGAAGTT
CATCTGCACTACTGGCAAACTGCCTGTTCCATGGCCAACACTAGTCACTACTCTGTGCTATGGTGTTCAA
TGCTTTTCAAGATACCCGGATCATATGAAACGGCATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATG
TACAGGAAAGGACCATCTTCTTCAAAGATGACGGCAACTACAAGACACGTGCTGAAGTCAAGTTTGAAGG
TGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGACTTCAAGGAAGATGGCAACATTCTGGGACAC
AAATTGGAATACAACTATAACTCACACAATGTATACATCATGGCAGACAAACAAAAGAATGGAATCAAAG
TGAACTTCAAGACCCGCCACAACATTGAAGATGGAAGCGTTCAACTAGCAGACCATTATCAACAAAATAC
TCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGCCCTTTCGAAA
GATCCCAACGAAAAGAGAGACCACATGGTCCTTCTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCA
TGGATGAACTGTACAAC
>GFP_CB
MASKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLCYGVQ
CFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGH
KLEYNYNSHNVYIMADKQKNGIKVNFKTRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSK
DPNEKRDHMVLLEFVTAAGITHGMDELYN

These are sequences that are the consensus from the chromatographs above

Alignment of GFP sequences with ClustalW. GFP-CB is the predicted GFP ORF (GFP_CB) from pGREEN from Carolina Biologicals, GFP_Genbank is the predicted ORF from AB124780  


 It is clear that the GFP that was sequenced is identical, and therefore consistent to that of SG11 from the patent as suggested by Carolina Biologicals, which has several amino acid changes relative to the original GFP from the Genbank Accession, which enhance flourescence.

The alignment below is between the pGreen consensus sequence above (pGREEN_insert) and that made available by Avery Louie (newGFP) done by ClustalW.