bash: transform scaffold fasta

I have a fasta file with the following sequences:

>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG

>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG

And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.

Expected solution:

>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG

>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG

my (inelegant) approach would be this:

perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'

of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N. What do you suggest?

2 answers

  • answered 2018-01-13 20:27 georgexsh

    I expaned your perl one-liner a bit:

    cat file.fasta | \
    perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
    perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
    

    the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

  • answered 2018-01-13 20:27 zdim

    I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.

    use warnings;
    use strict;
    use feature 'say';
    
    my $file = shift @ARGV;
    die "Usage: $0 filename\n" if !$file;  # also check submitted $file
    
    my $content = do {  # or:  my $content = Path::Tiny::path($file)->slurp; 
        local $/; 
        open my $fh, '<', $file or die "Can't open $file: $!"; 
        <$fh>; 
    };
    
    my @f = grep { /\S/ } split /(?<!>)NN+/, $content; 
    say shift @f; 
    
    my $cnt;
    for (@f) {
        say "\n>contig", (++$cnt), ":\n$_";
    }
    

    This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.

    The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.