squigualiser

Pipeline basic

Here we will go through each step from the raw signals file (slow5) to generating plots.

For this we will use r9.4_450bps.nucleotide DNA dataset. The dataset is at test/data/raw/pipelines/pipeline_0/r9.4_450bps.nucleotide. Note that the script must be executed inside the virtual environment where squigualiser is installed.

Pipeline variables

The following variables must be changed before running the bash script.

Variable name value set for this pipeline Description
SQUIGULATOR depends on user computer path to squigulator (minimum version v0.1.0)
SAMTOOLS depends on user computer path to samtools (minimum version v1.13)
MINIMAP2 depends on user computer path to minimap2 (minimum version v2.24)
F5C depends on user computer path to f5c (minimum version v1.3)
BGZIP depends on user computer path to bgzip (minimum version v1.13)
TABIX depends on user computer path to tabix (minimum version v1.13)
GUPPY depends on user computer path to dir containing guppy executables (minimum version v6.3.7)
BUTTERY_EEL_ENV_PATH depends on user computer path to python env where buttery-eel is installed (minimum version v0.2.2)
REFERENCE depends on user computer path to the reference genome file (please use a human genome if you use the test datasets)
MODEL_TO_USE dna_r9.4.1_450bps_fast_prom.cfg this has to be set to the name of the model used to basecall
CHUNK_SIZE 500 (optional) This is used in basecalling to limit GPU memory.
SIMULATING_PROFILE dna-r9-prom signal profile used by squigulator (-x argument)
RE_REFORM_K 3 kmer_length value used in the second run of reform. For new data this value can only be set after running calculate_offsets step (see pipeline steps) More details can be found at profiles
RE_REFORM_M 2 sig_move_offset value used in the second run of reform. For new data this value can only be set after running calculate_offsets step (see pipeline steps) More details can be found at profiles
PROFILE_TO_DETERMINE_BASE_SHIFT kmer_model_dna_r9.4.1_450bps_6_mer squigualiser uses this value to determine the appropriate base_shift. More details can be found at profiles
SIGNAL_FILE reads.blow5 SLOW5/BLOW5 signal file
READ_ID 8e1a33c4-af69-471c-a115-6428c8bf63df An arbitrary read_id from the SIGNAL_FILE to perform calculate_offsets_read_id step (see pipeline steps)
READ_REGION ${READ_ID}:1-500 The region to simulate and plot in the selected read.
REF_REGION chr16:46,389,459-46,390,388 The region in the reference to simulate and plot. This region cannot be determined without running minimap2 step. User can load the minimap2 aligned sam file to IGV and select a region.
SIM_REGION sim_ref:1-500 (optional) The region to plot in simulated read and the reference.
SIG_SCALE znorm (optional) squigualiser scales the signal. More information at commands

Pipeline steps

Each step listed in the bash script run.sh is discussed below.

Step Description
basecall_eel Buttery-eel is used to basecall and create moves.sam. Sam file is then converted to bam and fastq for later steps.
basecall_fast5 if Buttery-eel is giving errors use guppy directly to do the basecalling.
reform_move_table run reform with -k 1 and -m 0. Optionally, to skip steps 4, 5, and 6 provide the correct –profile argument. For this pipeline lets not skip the steps.
calculate_offsets use reform output to determine best k and m values.
calculate_offsets_read_id Cretae a pdf file with the distributions to visualise the separation (useful for the user to further confirm if the determined k and m values are optimal).
re_reform Run reform again using the determined k and m values.
simulate_read_signal Simulate a read using squigulator.
plot_sim_and_real_signal Plot both simulated and real signals in two separate tracks. Two pileup commands (one for plotting simulated and one for real signal) were added to a file. Note --profile argument is provided for the simulated signal plot. The profile name depends on what kmer model was used to simulate the signal. For the real signal plot --profile argument is not necessary as we have ran reform again with new k and m values to make the base shift 0. Then the file is given as the input to plot_tracks command.
minimap2_align Align the basecalled reads to the reference.
realign Run realign tool to re-align cigar mapping to moves.
simulate_ref_signal simulate part of the reference region using squigulator.
plot_realign_and_sim Plot both simulated reference and real signals mapped to the reference (using realign) in two separate tracks. Two pileup commands (one for plotting simulated and one for real signals) were added to a file. Note --profile argument is provided for the simulated signal plot. The profile name depends on what kmer model was used to simulate the signal. For the real signal plot --profile argument is not necessary as we have ran reform again with new k and m values to make the base shift 0. Then the file is given as the input to plot_tracks command.
f5c_eventalign Peform f5c eventalign to get the eventalignment file.
plot_eventalign_and_sim Plot both simulated reference and real signals mapped to the reference in two separate tracks. Two pileup commands (one for plotting simulated and one for real signals) were added to a file. Note --profile argument is provided for the simulated signal plot. The profile name depends on what kmer model was used to simulate the signal. In this case, also for the real signal plot --profile argument is necessary as we f5c used a kmer model. Since the same kmer model was used in squigulator and f5c, profile name the same. Then the file is given as the input to plot_tracks command.
plot_eventalign_forward_reverse So far all the plots only used forward mapped reads as it is the default. However, in this step forward and reverse mapped reads are plottted using the f5c eventalignment output. Note that --plot_reverse argument is provided in the second command. As usual, profile name is also provided.
plot_eventalign_forward_reverse Similar to step plot_eventalign_forward_reverse but this time using realign data instead of f5c eventalignment. Hence, profile name is not required.

You can use this basic pipeline script as a start and do the necessary changes to support your analysis.

A similar pipeline is available for dna_r10.4.1_e8.2_400bps data at test/data/raw/pipelines/pipeline_0/dna_r10.4.1_e8.2_400bps. A similar pipeline is available for rna_r9.4.1_70bps data at test/data/raw/pipelines/pipeline_0/rna_r9.4.1_70bps.