RNA-seq 数据处理（一）HISAT2 序列比对

2023-10-18 23:20 浏览:291 搜索引擎搜索“手机财发网”
温馨提示：为防找不到此信息，请务必收藏信息以备急用！联系我时，请说明是在手机财发网看到的信息，谢谢。

RNA- seq 的数据处理主要分为以下几个过程：

1. 测序数据质控，以及参考基因组和注释文件下载；

2. 序列比对，将测序得到的短 reads 序列往参考基因组上 mapping；

3. 差异表达基因鉴定。

更详细的内容见软件官网

HISAT2: graph-based alignment of next generation sequencing reads to a population of genomes

一、建索引

下载参考基因组和相应的注释文件：

wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

mv Homo_sapiens.GRCh38.dna.primary_assembly.fa genome.fa

wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz $ gzip -d Homo_sapiens.GRCh38.84.gtf.gz$ mv Homo_sapiens.GRCh38.84.gtf genome.gtf

make exon, splicesite file:

hisat2_extract_splice_sites.py genome.gtf > genome.ss

hisat2_extract_exons.py genome.gtf > genome.exon

Building indexes

hisat2-build -p 16 --exon genome.exon --ss genome.ss genome.fa genome_tran

# -p 线程数；

二、序列比对

hisat2 [options]* -x <hisat2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <hit>] --dta

-x    # The basename of the index for the reference genome.

-1    # Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq.

-2 # Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq.

-S    # File to write SAM alignments to.

--dta # Report alignments tailored for transcript assemblers including StringTie. With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites. This leads to fewer alignments with short-anchors, which helps transcript assemblers improve significantly in computation and memory usage.

上述 4 个参数是基本参数，如果输入文件是 FASTQ 文件，并按第一步建了索引文件，即可直接进行比对。

三、结果文件解读

hisat2 -p 8 -x $RNA_REF_INDEX -1 BB.read1.fastq.gz -2 BB.read2.fastq.gz -S ./HBR_Rep3.sam

生成的 sam 文件格式如下：

第1列：reads名称；

第2列：Flag标签；Flag标签是二进制数字之和，不同数字代表了不同的意义。Sum of all applicable flags.

1: The read is one of a pair

2: The alignment is one end of a proper paired-end alignment

4: The read has no reported alignments

8: The read is one of a pair and has no reported alignments

16: The alignment is to the reverse reference strand

32: The other mate in the paired-end alignment is aligned to the reverse reference strand

64: The read is mate 1 in a pair

128: The read is mate 2 in a pair

第3列：比对到的染色体信息；

第4列：比对到参考基因组物理位置；

第5列：比对质量值（0-60）；

第6列：CIAGR（记录插入、缺失等）；CIAGR中包含的是比对结果信息，表明了一条reads所有碱基的比对情况。比如CIGAR = 150M表示150bp的reads都比对到参考基因组上；

第7列：配对reads比对到的染色体，=表示相同；

第8列：配对reads比对到的染色体物理位置；

第9列：文库插入序列大小；

第10列：Reads序列；

第11列：质量值。

第12列：Optional fields.