Data Science in Entomology: A Students’ Guide for Tackling Big Data
By Victoria Pickens
Editor’s Note: This post is part of a series contributed by the ESA Student Affairs Committee. See other posts by and for entomology students here at Entomology Today.
Finally, the moment had come. After months of collections and preparation, I was going to analyze my data. After initial analysis, however, my advisors and I figured out that climate data would be an important addition. I imported all the data into Excel, and it all went downhill from there.
Suddenly, functions were stumbling to run, and the software kept shutting down more and more often. On top of that, I had multiple files that needed to perform the same functions, so a process that would normally take minutes with a data set quickly became an uphill battle. Then I had to amend all my previous data to include the new climate data.
Thankfully, I had taken a Data Science course the preceding semester. Using Python, I could write scripts that performed functions in seconds for all the files at once. What a life saver!
The Power of Data
With the advancement of technology in the past few decades, information has become increasingly attainable. Things like DNA sequencing, formerly an expensive service, have rapidly declined in price and made it easier and more common. Cheaper and more efficient technology and services have also created an expectation for scientists to be capable of obtaining more data. However, this has led to a surge in “big data,” or data sets that are too large or complex for traditional analytical methods. Consequently, big data has arisen as the next challenge students in various scientific disciplines, including entomology, must learn how to tackle.
Data science is a field that has developed along with the surge in technology. It encompasses the preparation of large data sets, performing analyses, and presenting trends in the data that would otherwise be difficult to do using standard methods. Typically, it is performed using programming languages like R, SAS, and Python. Many people are familiar with such methods for genomic data analysis, but they are also capable of developing predictive models for ecological questions.
These tools are also amazing for developing pathways to perform repetitive functions in large data sets. I have a script that takes a folder of FASTQ files from my Sanger sequencing data and it cleans, quality checks, and BLASTS all of my sample sequences for identification. Not only can I have it perform all of these functions on over 50 samples in seconds, but I can also use this script on future sample data.
But where do you begin?
My first experience with big data was in my junior year of college. I had collected 16S microbiome data for my undergraduate research project and had no idea what to do with it. I had limited experience with computers or data analysis before then, and I didn’t recognize things like command prompts or supercomputers. My mentor and colleague were also unfamiliar with bioinformatics or data science, and so we spent most of our time attempting to learn how to process and analyze this massive amount of data we had received. We attended workshops on R and R Studio to learn the basics, and we met regularly with a bioinformatics group to discuss what problems we were facing as we worked through the data.
Since then, I’ve been able to get a better grasp on handling complex data in graduate school by attending a bioinformatics and data science course, as well as watching online videos and taking an online course in Python. During this process, I’ve learned some key lessons that would be beneficial for others who are just getting started:
Learn a programming language. Looking back, my biggest mistake when beginning to work with big data was not taking the time to properly learn a programming language. As mentioned before, R, SAS, and Python are among the most popular languages. I highly recommend using one of these, as there will be a larger amount of resources available to assist you in navigating through challenges. Websites like Udemy and Coursera offer great online courses, and you’ll find that they gently nudge you into learning data science at the same time.
It’s also important to practice regularly because, as it suggests, you’re learning a language. I recently took a break from Python for a few months to focus on lab work, and now I struggle to remember how to do the simplest tasks. I find myself wasting time trying to remind myself what I’d already learned!
Find someone with experience. It doesn’t need to be a professional; it could even be a fellow student. The point is to have someone who can at least help steer you in the right direction without having to dig for hours online. If you can’t find someone to help you, turn to science-based Q&A websites like ResearchGate or, even better, post on community forums.
Since data science is relatively new and there are constant advancements, community forums are a great online source for scientists around the world to tackle new problems that arise. Many universities are also now forming free computing service groups that help maintain servers and assist university members with their research.
Create a learning group. Similarly, this can be one of the more difficult but worthwhile tasks. By creating a group of fellow students interested in data science, not only can you help each other with questions, but discussing it with a group can also help you retain the information better. Plus, data science is a broad field, and you’ll find that different members of your group will be better at different aspects.
Know your research goals. Data science offers a large variety of analytical procedures, to the point where it’s almost distracting. Too often you can find yourself falling down rabbit holes exploring different aspects of your data you hadn’t previously thought of before. While perfect for improving your analysis, it’s also important to maintain your research goals.
Before you begin, write down which tasks need to be performed and which questions need to be answered. Beware of straying from the path too often, as it can prolong your data analysis to the point that it can become difficult to publish. Furthermore, it is important to know what you need from your data to identify proper procedures for your analysis, since there are so many options now available.
Jump in! For me, the hardest part of data science was getting started. Having no experience with computers or programming, I felt I would never be capable of writing a script or using a supercomputer without breaking something. Once I began working with other students, talking with experienced data scientists, and taking courses, I quickly realized that everyone has to start somewhere. Plus, you can’t be afraid of failure in data science, because using it is also one of the best ways to learn. So, don’t be afraid to try it and explore all it has to offer along the way!
Victoria Pickens is a graduate research assistant in the Department of Entomology at Kansas State University in Manhattan, Kansas, and the Medical, Urban, and Veterinary Entomology Section Representative to the ESA Student Affairs Committee. Email: firstname.lastname@example.org.
Excellent article. I would add that there are many websites that show solutions to common issues. This is useful when your experienced friend is not available. In choosing a program, consider that SAS might be free for students, but it is expensive after that. There is a SAS license that needs yearly renewal and this cost has increased greatly the last few years. R and python are free and likely remain so for the foreseeable future.
This one is a good article and will solve many problems occuring in data collection and interpretation. I get alot of knowledge from this article and appreciate the author too.