4. sORF assembly, (non) splice-aware
The sORF assembly screens for in-frame stop codons starting from all identified TIS positions.
Either no annotation is taken into account (non splice-aware)
or all splice sites and their correspond rearranged mRNA transcripts are mapped.
Next to defining the genomic coordinates, the
mass of the resulting peptide, and the DNA/AA sequences; a number of other
characteristics are calculated enabling downstream evaluation of the identified
sORF sequence: annotation (based on TIS location in 5'UTR, exonic, intronic, 3'UTR,
ncRNA or intergenic regions), % of overlap with an annotated exon region, nearest gene (for intergenic sORFs)...
For sORFs with multiple possible Ensemble annotations (i.e. protein-coding/lincRNA), an annotation rank list was constructed
and the sORF is attributed the highest ranked annotation.
Annotation rank in descending order : 'protein_coding', 'nonsense_mediated_decay', 'non_stop_decay', 'lincRNA' 'antisense', 'sense_intronic', 'sense_overlapping', '3prime_overlapping_ncrna', 'macro_lncRNA' 'processed_transcript', 'retained_intron', 'processed_pseudogene', 'unprocessed_pseudogene' 'transcribed_unprocessed_pseudogene', 'transcribed_processed_pseudogene', 'unitary_pseudogene' 'polymorphic_pseudogene', 'pseudogene', 'transcribed_unitary_pseudogene' 'translated_unprocessed_pseudogene', 'TEC', 'NA', 'nohit'.