

C++练习，切除reads前5bp

Dr.X的基因空间

2024-01-25

导读：C++练习

C++练习，切除FASTQ文件前5bp

写在前面的
前一段时间为了追求速度尝试写了C++程序，这段时间本着不遗忘C++的初心，准备再随便写一写C++。考虑到生物信息学中会大量处理FASTQ文件，特别是切除reads。比如过滤低质量序列、去除接头等本质上是对reads进行切割操作。所以我试着写了一下如何切除fastq文件中reads的前5bp的C++程序。

C++处理FASTQ文件的优势

C++的优势在于它可以直接操作二进制数据，而不需要将文件转换为文本格式。这样可以避免因文本到二进制转换导致的潜在问题。一般情况下，FASTQ数据都是压缩格式的文本文件，而压缩文件本身是属于二进制的，所以C++在处理FASTQ文件时有显著的优势。除此以外，C++的内存管理机制可以让程序更有效地处理大文件，从而提高程序的性能。

切除FASTQ文件的前5bp序列

#include <iostream>
#include <fstream>
#include <string>
#include <gzip>

int main() {
    // input your fastq data
    std::string input_file;
    std::cout << "gzip format：";
    std::cin >> input_file;

    // output name
    std::string output_file;
    std::cout << "gzip format output：";
    std::cin >> output_file;

    // trim 
    int num_bp_to_trim = 5;
    std::cout << "length to be cutted：";
    std::cin >> num_bp_to_trim;

    // 打开输入文件和输出文件
    std::ifstream infile(input_file, std::ios::in | std::ios::binary);
    std::ofstream outfile(output_file, std::ios::out | std::ios::binary);

    if (!infile.is_open() || !outfile.is_open()) {
        std::cerr << "Error, can’t open such file！" << std::endl;
        return 1;
    }

    // read fastq file
    std::string line;
    while (std::getline(infile, line)) {
        // use gziplizb to compress the ouput
        if (input_file.find(".gz") != std::string::npos) {
            std::stringstream ss(line);
            std::string uncompressed_line;
            std::getline(ss, uncompressed_line, '\n');
            line = uncompressed_line;
        }

        // 
        if (line.substr(0, 1) == "@") {
            outfile << line << std::endl;
        } else {
            // trim the first five bp of the reads
            std::string trimmed_line = line.substr(num_bp_to_trim, line.length() - num_bp_to_trim);
            outfile << trimmed_line << std::endl;
        }
    }

    // output
    infile.close();
    outfile.close();

    return 0;
}

程序首先读取用户输入的Fastq文件名、输出文件名和要切除的bp数。然后，它打开输入文件和输出文件。接下来，程序读取输入文件中的每一行。如果文件是gzip压缩的，它会将读取到的行解压缩。然后，对于读取到的每条FASTQ文件中的reads，它将读取到的行从第6个bp开始（即第5个bp之后），并将这些bp写入输出文件。

【声明】内容源于网络

Dr.X的基因空间

【中国科学院博士】10年生命科学数据挖掘研究经验，关注生物医药领域体外诊断（IVD）方向，如肿瘤早筛、传染病未知病原快速检测中的技术创新及其与人工智能（AI）的赋能应用

内容 176

粉丝 0

Dr.X的基因空间【中国科学院博士】10年生命科学数据挖掘研究经验，关注生物医药领域体外诊断（IVD）方向，如肿瘤早筛、传染病未知病原快速检测中的技术创新及其与人工智能（AI）的赋能应用

总阅读169

粉丝0

内容176