Research Cyberinfrastructure team helps researcher press play on YouTube study

With 2 billion active users every month watching videos on YouTube, the video giant is an undeniably dominant worldwide platform to spread information.

But what if that information is false or misleading? And what do we really know about who is actually producing the videos we watch?

These are key questions behind the research of Kaiping Chen, an assistant professor of computational communication in the University of Wisconsin–Madison’s Department of Life Sciences Communication and an affiliate of the UW–Madison Robert & Jean Holtz Center for Science and Technology Studies.

While Chen has studied political communication in the digital space, her current work also delves into how the average internet user consumes science content online—from climate change to COVID-19.

Take climate change: On one hand, conspiracy theories abound on YouTube, Chen explains, noting videos that purport government manipulation of “the weather weapon.” But on the other hand, plenty of videos on YouTube also feature ordinary citizens attempting to educate people about the science of climate change.

“And both types of videos receive a lot of views and engagement,” Chen says.

Kaiping Chen, an assistant professor of life sciences communication, speaks to her LSC 250 class. Teaching assistant Shiyu Yang, is at background left. Photo (taken before COVID-19 restrictions) by Michael P. King/UW–Madison CALS

“But while there are many science videos on YouTube, scholarship is still very scarce when it comes to actually studying digital knowledge production on this large and influential platform,” adds Chen, explaining that her research and teaching (LSC250 and LSC375) aims to harness data science skills for solving social problems, providing a path for “constructive online citizens.”

But before Chen can contribute meaningful scholarship in this emerging field, she needs tools capable of handling the massive amounts and various types of data produced from analyzing thousands of YouTube videos.

That’s where the UW–Madison Division of Information Technology’s Research Cyberinfrastructure team entered the frame, helping Chen press play on her research quests.

Looking to ‘the cloud’

Familiar with the cloud computing platform Microsoft Azure from her previous work, Chen came to the Research Cyberinfrastructure team hoping to talk about options for massive amounts of data extraction from YouTube, as well as data analysis and storage.

In Chen’s case, consultation and collaboration with the Research Cyberinfrastructure team involved digging into YouTube’s application programming interface, or API, to examine and test search functionality that would best support her analysis of the science video footage she wanted to study.

YouTube is a big data challenge from a volume standpoint: there’s too much data to download all of it. And once downloaded, working with the data is complicated because each video has many components and data formats (e.g., the video itself, user comments, closed captioning or transcripts, and statistics on views, likes and shares).

Compared to analyzing structured data in spreadsheets with statistical analysis software, analyzing big, unstructured data like YouTube’s is much more like software development—it involves developing and managing numerous code elements that perform functions like formatting, organizing, and selecting data, and stitching together even more software components of the data analysis and visualization workflows that come later.

“It was a lot of trial and error, but we found the optimal way of collecting the data my team and I needed,” Chen said. “It was very helpful, and I got a lot of support.”

Chen added that she particularly appreciated the Research Cyberinfrastructure team’s thorough attempts to understand what she was trying to accomplish before offering help.

The planning phase is key and starts with mapping out end-to-end how you will collect and store data, what you need to do to extract, transform, or load it to make it suitable for analysis, and then what steps are necessary to process it and visualize the results. With a high level plan in place the next step is to develop a solution that works for one data point and then build a repeatable pipeline that can be automated and scaled up to process large volumes of data.

“Many times when you ask somebody for help, they’ll say, ‘Oh, I’ll send you this link,’ and then you explore yourself,” Chen said. “But what Mike and his team did was first invite me and my students to their office to walk us through a demo, step by step. It wasn’t just a huge information load—they showed me how to use it, and then even showed me some example ideas that were very related to my interests.”

Changing the game for research

In more traditional models of academic research, setting up a lab used to involve purchasing servers and equipment up front in order to run computing software “on-premise,” meaning that the necessary software gets installed locally, on the lab’s own computers and servers. But as academic research increasingly leverages emerging third-party public cloud options, those start-up capital expenses may become less necessary.

That’s because the public cloud model is centered around on-demand computing resources—everything from operating systems to applications to servers to massive amounts of data storage are available online on a pay-for-use basis. Software and data are then hosted by vendors on remote computers “in the cloud,” with researchers connecting via their browser or program interface on their own computers.

Using public cloud options, researchers may spend less on equipment up front. But researchers’ cost concerns then shift to maximizing the work they can do within the cloud’s pay-for-use system as they progress in their investigations.

“Essentially, you pay for what you’re using at the time when you’re using it,” explains Mike Layde, research data storage lead with the Research Cyberinfrastructure team. “So there’s a big emphasis on optimizing everything you do.”

Chen and her students optimized their data collection by developing a process that could be run in multiple environments: on laptops, campus servers, or cloud computing resources. This approach allows students to complete many of the smaller computational steps on a laptop computer, and then use more powerful graphics processing unit (GPU) processors in Microsoft Azure. This way, they can avoid purchasing expensive GPU cards for their in-house computers, and also spend their dollars in Azure on the most computing-intensive parts of their workflow.

‘Try before you buy’

With cloud computing’s myriad tools, approaches and capabilities, researchers can build toward their end game, rather than having to jump in all the way, right away. And that’s starting to change how academic researchers seek their funding, explains Chris Lalande, research cloud technician with the Research Cyberinfrastructure team.

“One of the really neat things about a cloud use case is that when a researcher has a hypothesis or an idea, they can try it out on a really small scale, without making a major investment,” Lalande says.

“They can kind of ‘try before they buy,’” added Lalande, whose new role involves consulting with researchers across campus to find the best fit for their data analysis and storage needs. Using this try-it-first approach, researchers can start projects with just a small amount of expenditure.”

“If they’re seeing their sentiment analysis is giving really interesting results, then they have a good story to tell to go and get a bigger grant,” Lalande explains. “They can start small, test their hypothesis—or at least show that it’s promising—before they have to invest a lot of money. It makes it a lot easier for the researchers.”

As a research cloud technician, Lalande helps researchers navigate the cloud space and understand what services they may be able to obtain for lower costs, or even free, depending on when and how they run their analysis.

“Part of my role is to help guide them on how to avoid accidentally spending a bunch of money, which is sometimes the fear of using the cloud,” Lalande says.

New features for things like database utilities, machine learning platforms, and virtualized computing resources are popping up in the offerings of public cloud vendors on a regular basis. Many of these could be of potential interest to UW–Madison researchers. However, “researchers don’t have the time to stay up on what all is available to them in the cloud—nor is it a good use of their expertise,” Lalande explained.

“I can guide them toward the tools that make the most sense for their research and also the most economical ways to use them.”

Get help

To learn more about the UW–Madison Research Cyberinfrastructure initiative and services, or to request a consultation, please visit: https://researchci.it.wisc.edu/.