SnapGene allows me to import and export "GenBank - SnapGene" format. What is this format and how can I use it?
The "Genbank - SnapGene" format adheres to [GenBank | GenPept] conventions. However, additional qualifiers are used to encode information used by SnapGene that is not typically captured in the [Genbank | GenPept] format such as:
- the sequence author
- custom map label and alias
- feature directionality, segment ranges, segment colors, segment names, and cleavage arrow positions
- primer name, description, sequence, color, 5' phosphorylation, and date added
You can use the following format specifications in your external workflows to generate "Genbank - SnapGene" format files that can be read by SnapGene.
File Header
In order to be able to decode additional information from qualifiers and other fields, it is necessary to clearly advertise the file is using the "GenBank - SnapGene" format as follows:
1. The LOCUS name must contain Exported
.
LOCUS Exported 4373 bp ds-DNA circular SYN 07-MAR-2017
2. The last REFERENCE must include the TITLE Direct Submission
as well as a JOURNAL entry that contains SnapGene
.
REFERENCE 2 (bases 1 to 4373)2
AUTHORS [Listed here is the Sequence Author or a "." character.]
TITLE Direct Submission
JOURNAL Exported Apr 28, 2017 from
SnapGene 3.3.45 http://www.snapgene.com
Note that these requirements are case-sensitive.
Sequence Label
If specified, the custom map label encoded using the KEYWORDS field:
KEYWORDS Custom Map Label
Sequence Alias
If an alias is required it is encoded using the last COMMENT field:
COMMENT Alias: pBSG307
Sequence Author
If the last REFERENCE has text other than "." in the AUTHORS field, that text will be imported as the Sequence Author. An alternative JOURNAL entry would be:
JOURNAL SnapGene GenBank format
Features
Information about features is encoded using an additional /note
qualifier.
Such information is only decoded from the last /note
qualifier if pre. Key/value pairs are encoded as key: value
. Multiple values are separated by semicolons, for example:
/note=color: #ffd281; direction: BOTH
Any prior /note
qualifiers hold actual notes about the feature.
Directionality
Note that directionality determines how the direction of a feature is depicted in SnapGene.
The orientation of a feature, if on the reverse strand, is defined using the complement qualifier:
CDS complement(3518..4378)
The GenBank format provides ambiguous information regarding directionality. There is no way to encode bidirectionality, and directionality can only be implied using the /translation
and /direction
qualifiers. As a result, it is impossible to detect the directionality of non-translated forward directional features. We encode the directionality as:
direction: [RIGHT | LEFT | BOTH]
Directionality is omitted if the feature is nondirectional, or if the directionality is implicit because the feature is translated or has a /direction
qualifier.
Feature Color
The color of a single-segment feature is encoded as color: #RRGGBB
using the standard hexadecimal format for RGB colors.
/note="color: #FF0000"
Line appearance is encoded as #------
.
Multi-Segment Features
In order to encode the name, color, and range for segments in a multi-segment feature, a multi-line note qualifier is employed whose value is enclosed in quotes.
/note="This FEATURE has N segments:
1: # .. # / #ff0000 / First Segment Name
...
N: # .. # / #0000ff / Last Segment Name"
For example:
/note="This bidirectional feature has 2 segments:
1: 1001 .. 1298 / #ff0000 / One
2: 1299 .. 1596 / #00ff00 / Two"
The first line is used to indicate the number of segments as well as the feature directionality and can take the following forms:
This feature has # segments:
This forward directional feature has # segments:
This reverse directional feature has # segments:
This bidirectional feature has # segments:
The first variant is used for a non-directional feature, or for a feature in which the directionality is implicit because the feature is translated or has a /direction
qualifier.
Subsequent lines are used to encode information about individual non-gap segments. Each segment is encoded using the following format where SEGMENT_NAME
and the preceding backslash are only included for named segments.
SEGMENT_NUMBER: FIRST_BASE .. LAST_BASE / #RRGGBB / SEGMENT_NAME
For example:
1: 1000 .. 2000 / #ff0000 / Red Segment
For each segment, the segment number, range, color, and, if specified, name, are separated by /'s
. Segments are always listed in order, although gaps between segment ranges may be present.
Cleavage Arrows
If a cleavage arrow is present between two segments or at an end of the feature, this information is encoded at the end of the last /note
qualifier, using the following format:
Cleavage site after base [BASE NUMBER]
or,
Cleavage sites after bases [BASE NUMBER, BASE NUMBER, ...]
For example, the last line of the last /note
qualifier might read:
Cleavage sites after bases 5, 16, 200
Primers
Primers are encoded as primer_bind
features.
Any primer_bind
features that include sequence data are imported not as primer_bind
features, but rather as SnapGene primers. If there are multiple binding sites for a primer, only one copy of the primer is generated during import.
Name
The primer name is encoded using the /label
qualifier, e.g.
/label=Primer Name
Description
If present, the primer description is recorded using a /note
qualifier, e.g.
/note="This is a primers description."
other Attributes
The final /note
qualifier uses keys to specify primer color
, primer sequence
, phosphorylation (if present), and if known, the date the primer was added
to the sequence.
Keys are separated by values using Key: Value
format. Multiple Key / Value pairs are separated by semicolons.
/note="color: orange;
sequence: GCTCATGCCATTGGCGTTAACTCTGCTTCTTGGGCTCCAGCTACC;
added: 2021-04-05;
5' phosphorylated"
Color
The color values must be lowercase:
[ black | red | orange | green | blue | purple | gray ]
Sequence
The primer sequence is case sensitive and can include a mixture of upper and lower case characters.
Date
The UTC date the primer was added to the file, if known, is encoded with:
YYYY-MM-DD
5' Phosphorylation
If the primer is 5' phosphorylated this is included at the end of the terminal note qualifier:
5' phosphorylated
Example
LOCUS Exported 2894 bp DNA linear UNA 05-APR-2021
KEYWORDS Custom Map Label
REFERENCE 1 (bases 1 to 2894)
AUTHORS .
TITLE Direct Submission
JOURNAL Exported Monday, Apr 5, 2021 from SnapGene 5.3.0
https://www.snapgene.com
COMMENT Alias: This is an example of an alias
FEATURES Location/Qualifiers
misc_feature 740..1000
/label=Reverse Directional Green Feature
/note="color: #00FF00; direction: LEFT"
misc_feature 1001..1894
/label=Simple Name
/note="This bidirectional feature has 3 segments:
1: 1001 .. 1298 / #ff0000 / First Named Segment
2: 1299 .. 1596 / #00ff00
3: 1597 .. 1894 / #0000ff / Last Named Segment
Cleavage site after base 1800"
primer_bind 1427..1471
/label=FOR
/note="Here is the forward primers description."
/note="color: orange; sequence:
GCTCATGCCATTGGCGTTAACTCTGCTTCTTGGGCTCCAGCTACC; added:
2021-04-05"
primer_bind complement(1649..1676)
/label=M13rev
/note="standard sequencing primer"
/note="color: black; sequence:
aaacactGGCCAAATAagaacgtagaag; added: 2021-04-05;
5' phosphorylated"
Importing
Information at the top of the [GenBank | GenPept] file is imported into the Description Panel. Most of the conversion follows an obvious path, but the following should be noted:
- The DEFINITION is imported into the Description box.
- The KEYWORDS field is normally not used, in which case it is populated with a "." character. If any other text is in the KEYWORDS field, it is recognized upon import as a sequence label for the map.
- A Natural DNA sequence has a three-letter code other than
SYN
in the LOCUS line. The SOURCE is imported as the Source (called the Source Organism prior to version 4.1), and the Sequence Class is imported from the three-letter code in the LOCUS line. - A Synthetic DNA sequence has the three-letter code
SYN
in the LOCUS line. If present, the/lab_host
qualifier in the source feature is imported as the Laboratory Host (called the Laboratory Host Organism prior to SnapGene version 4.1). - In the FEATURES section, the contents of the first
/label
qualifier are used as the default choice for the feature name, for example;
/label=AmpR
Note that prior to SnapGene 3.3.4, the feature name was typically encoded in the first /note
qualifier, and that format is still recognized by the importer in the absence of a /label
qualifier.