As a final-year biomedical science student your dissertation project can be a daunting task and knowing just where to start can be a full task in itself. Bioinformatics can be a great starting point as it can pave the way for other wet lab-based experiments if there are promising results in the data. But what if you have no current knowledge in bioinformatics? Where can you find information to know where to start? Here I have created this article as a guide for beginners in bioinformatics by taking you through my personal journey, and through YIBS I am able to share my experience.
Bioinformatics is an amalgamation of biology and computer science techniques. It allows scientists and researchers to better analyse, interpret, visualise, and integrate large datasets leading to new scientific discoveries. It has become increasingly more utilised due to an influx of data from continuously emerging laboratory techniques. An example of a recent application involved the use of a network-based analysis to investigate the molecular response of immune and cancer cell lines treated with L-pampo. Using a clustering algorithm named Weighted Gene Co-expression Network Analysis (WGCNA), groups of genes involved in the same biological pathways were identified, allowing key inferences to be made regarding their molecular functions. Whether it be analysing RNA sequencing data too look at differentially expressed genes (DEGs) in cancer, or molecular modelling and drug design for therapy, the possibilities are seemingly endless.
Be Prepared…
Starting can be difficult, so the first goal should be to know the question you want to answer. Start by reading primary research papers that interest you and look for unanswered questions. These may be explicitly said or may require a bit of thought to find the gaps.
In the case of my project, I was interested in investigating regulatory regions of the human genome, particularly in acute myeloid leukaemia (AML). The base for my project was built upon one research question – Are there regulatory elements in AML that act as both silencers and enhancers? I set out to use a multi-omics approach using publicly available raw ChIP-Seq and RNA-Seq data to answer this question.
Common sequencing methods that require a bioinformatics analysis include; DNA and RNA sequencing (DNA-Seq and RNA-Seq), assay for transposase-accessible chromatin sequencing (ATAC-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq). Being aware of how the technique is carried out in the lab can help you visualise your analysis and troubleshoot when things go wrong.
A key starting point is knowing how to navigate online databases to retrieve data. There is an abundance of online databases containing data from cell lines and primary samples in various diseases and cancers. Examples include the Gene Expression Omnibus (GEO), the Cancer Genome Atlas Programme (TCGA), the ENCODE database and many more. These databases contain large amounts of sequencing data (such as RNA-seq, ChIP-Seq and ATAC-Seq) which can be helpful if you are not using data that you collect yourself. Deposited data from samples are given a sequencing read archive number (SRA number). When obtaining all your data in one place you will need the SRA number for each sample. Alternatively, if you have the resources available you can collect your own data in the lab first.
Taking the time to thoroughly plan your research can help clearly outline your aims and objectives. Tip! check that you have enough data and replicates available for the cell line or primary samples you need for your research.
Learning the skills!
The truth is at the start you may have some or no bioinformatics skills but simply have the desire to learn. Regardless of your skillset, anyone can learn. Having previous knowledge of programming languages such as R and Python are useful skills to have but this shouldn’t deter you if you haven’t built these skills yet. Another useful skill is being able to navigate command line tools which are often used in bioinformatics. There are many online courses available that take you through the basics of various programming languages which can be a good starting point. But if anything, start with R!
At the beginning of my journey, I had only one module of bioinformatics in my second year which covered the basics of R and various statistical analyses used for different data types. This provided me with a building block and later in my project aided with the downstream analysis. However, when using raw data, you must first run quality control, cleaning, and alignment of the data to the reference genome- all of which required some research and learning to achieve.
Using the Linux command line was a huge learning curve. I didn’t recognise how much of my project would be cultivating this skill or how useful it was in bioinformatics. By joining a coding club at my university, I got to grips with using bash commands and syntax to create folders and files, copy and delete files and file editing (which later became useful for writing scripts). The rest I learned along the way through searching online forums, YouTube videos, and asking classmates.
In bioinformatics, there are some common tools and packages that are used, knowing which ones you will need to complete your aim and objectives is essential. By reading the material and methods sections of published bioinformatics-based research you can figure out the packages/tools you need for different tasks. Common tools for RNA-Seq analysis include FASTQC, TrimGalor, cutadapt, STAR, RSEM and Bowtie2 but there are many more that could be better suited to your data. Most tools and packages come with a manual describing the different parameters that may be set to ensure the tool is working optimally to produce high-quality data for downstream analysis. For example, if you are doing a ChIP-Seq analysis you should ask yourself questions such as – what does my data look like? Am I analysing histone marks or transcription factors? These are important things to consider as the lengths of these sequences are quite often different and should be accounted for.
Whilst using the command line you can easily run a tool individually, but this may take time if you have a lot of samples and for a dissertation, time is limited. Therefore, writing a script can make the process more efficient. Script writing goes hand in hand with programming and was probably the steepest learning curve for me as a beginner. Start by using online forums and even AI where possible to help you get started- but always remember to run the script on one sample first to check it works.
Troubleshooting
Troubleshooting as a beginner with no prior experience can be a challenge but as previously mentioned knowing the steps for sequencing techniques can help. However, it may not be problems with the data itself but rather the usage of tools. Don’t be afraid to ask your supervisor for help or refer back to the tools manual. It may also be helpful to get to know other students doing similar projects to you so you can collectively save time on errors. Errors can seem very daunting at first but as you get more practice solving them, they become like second nature to fix.
Conclusion
If you have made it to the end of the article, then I hope that it has given you some confidence to take the leap and get started on your bioinformatics journey no matter your current skill set. Upon completion of my dissertation, I realised that bioinformatics is not so scary and is actually a wonderful and exciting way to gain novel insights into biomedicine. I still have lots to learn and I hope to share my knowledge with you through YIBS. If you enjoyed reading this and would like to read more about my dissertation you can find that here. If you would like to learn more about RNA, DNA, ChIP and ATAC sequencing methods from the lab to the computer just follow the links below. Good luck in your future endeavours!