Putting the Field in Fieldwork: Testing the Quality of Commercial Data

Every day, independent consultants and scouts comb through farm fields, swinging sampling nets to monitor pests. A new study of the data generated by commercial sampling suggests that repeatability is sufficient for such data to be an important, useful resource for agricultural entomologists. (Photo by uwextension ct via Flickr, CC BY-NC-ND 2.0)
By Melissa Mayer
Every day, independent consultants and scouts comb through farm fields, swinging sampling nets to monitor pests. Jay Rosenheim, Ph.D., distinguished professor of entomology at the University of California, says the data generated this way could be an untapped resource for agricultural entomologists.
As reported in a short communication published in June in the Journal of Economic Entomology, Rosenheim tested one such dataset for repeatability. This measure describes how closely estimates of insect density made by different people match. When repeatability is high, the data is more reliable.
Rosenheim reports that, while repeatability was a little lower when compared to similar data generated by researchers, the massive size of commercial datasets makes up for any small losses in precision.
Harnessing Big Data
“There’s a limit to how much data we can collect with our own hands as a researcher,” says Rosenheim. “And, when you have those limitations on how much data you can collect, that imposes real limits on what kinds of questions you can answer.”
This is a particular problem for questions that require researchers to resolve very small effect sizes. For example, some farmers operate on razor-thin profit margins, so a pest that decreases yield by just 1 percent could represent a quarter of that farmer’s annual profits.

Every day, independent consultants and scouts comb through farm fields, swinging sampling nets to monitor pests. Jay Rosenheim, Ph.D., distinguished professor of entomology at the University of California, says the data generated this way could be an untapped resource for agricultural entomologists. (Photo by Kathy Keatley Garvey)
But designing an experiment that granular would mean a massive number of replicates, which simply isn’t feasible using traditional methods. That’s where large aggregate datasets could make a difference.
“There’s a huge army of people all across the country and across the world who are monitoring and scouting and counting to help farmers make decisions in real time,” says Rosenheim. “If we can marshal those data and pull them together from many, many different sources [and] create one centralized database that puts all of that data into one place where it can sort of communicate with itself and be integrated, then we can work with a number of observations that’s maybe a hundred times or maybe a thousand times bigger than what we could generate, if we were gathering the data with our own hands.”
Of course, that kind of data-sharing can bring up privacy issues for farmers—especially in sectors with thin profit margins or where trade secrets are closely guarded—and it can be time-consuming for those sharing the data with researchers.
“It really helps if you know the people,” Rosenhem says. “And you can also show them that, although it’s a chore to gather the data for you, the results that you obtain from analyzing those data are actually useful to them. You have to really attend to the interests of the people who are putting themselves out to share data.”
The big question is whether data collected by non-scientists can meet the standard for robust research.
Can You Repeat That?
To answer that question, Rosenheim looked at the repeatability of commercial data from cotton fields at three ranches in the San Joaquin Valley. There, scouts monitored western tarnished plant bugs (Lygus hesperus), a particularly pesky pest when it comes to sampling. That’s because the insect’s distribution is patchy, and it causes problems at low densities, so scouts must sample in areas with fewer bugs. Plus, the nymphs are tiny.
This data was perfect for analyzing repeatability because some of those fields were sampled twice in the same day by different scouts. Rosenheim also looked at a smaller, researcher-collected dataset he generated in 2004 for comparison.
Overall, repeatability for the commercial dataset was moderately high, with 63.1 percent of the variation explained by differences in insect density between fields. Accounting for observer effects—like differences in how individual samplers swing the net—brought that number up to 69.7 percent.
Repeatability was higher for the researcher-generated dataset, but Rosenheim says the sheer size of commercial datasets can outweigh the noise—especially at times when using the farmer’s own data could make it easier to work together.
And Rosenheim is quick to point out that independent consultants and farmers are also experts. “If this method is applied to lots of different data sets, some data will be higher quality, some data will be lower quality,” he says. “I’m sure it will vary. But I won’t be surprised if many of the commercially generated datasets are actually very good, because the guys who do it are really good at what they do.”
Melissa Mayer is a freelance science writer based in Portland, Oregon. Email: melissa.j.mayer@gmail.com.