7.1 FASTA and FASTQ formats

High-throughput sequencing reads are usually output from sequencing facilities as text files in a format called “FASTQ” or “fastq”. This format depends on an earlier format called FASTA. The FASTA format was developed as a text-based format to represent nucleotide or protein sequences (see Figure 7.1 for an example).

An example fasta file showing the first part of the PAX6 gene.

FIGURE 7.1: An example fasta file showing the first part of the PAX6 gene.

The first line in a FASTA file usually starts with a “>” (greater-than) symbol. This first line is called the “description line”, and can contain descriptive information about the sequence in the subsequent lines. The description can be the ID or name of the sequence such as gene names. However, very infrequently you may see lines starting with a “;” (semicolon). These lines will be taken as a comment, and can hold additional descriptive information about the sequence in subsequent lines.

An extension of the FASTA format is FASTQ format. This format is designed to handle base quality metrics output from sequencing machines. In this format, both the sequence and quality scores are represented as single ASCII characters. The format uses four lines for each sequence, and these four lines are stacked on top of each other in text files output by sequencing workflows. Each of the 4 lines will represent a read. Figure 7.2 shows those four lines with brief explanations for each line.

FASTQ format and a brief explanation of each line in the format.

FIGURE 7.2: FASTQ format and a brief explanation of each line in the format.

Line 1 begins with the ‘@’ character and is followed by a sequence identifier and an optional description. This line is utilized by the sequencing technology, and usually contains specific information for the technology. It can contain flow cell IDs, lane numbers, and information on read pairs. Line 2 is the sequence letters. Line 3 begins with a ‘+’ character; it marks the end of the sequence and is optionally followed by the same sequence identifier again in line 1. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each letter corresponds to a quality score. Although there might be different definitions of the quality scores, a de facto standard in the field is to use “Phred quality scores”. These scores represent the likelihood of the base being called wrong. Formally, \({\displaystyle Q_{\text{phred}}=-10\log _{\text{10}}e}\), where \(e\) is the probability that the base is called wrong. Since the score is in minus log scale, the higher the score, the more unlikely that the base is called wrong.