Cancer Researchers: Do You ‘Speak Data Science’? Test Your Knowledge

March 12, 2024, by NCI CBIIT Staff

So, you’re new to the cancer research lab. Maybe you’ve started learning more about data science to enhance your research, or perhaps you have a colleague with data science expertise and you want to improve your collaboration with him or her. Since data science is here to stay, learning the correct definitions for data science terms and understanding basic data science concepts will help you be more confident throughout your career.

We’ve put together this 10-question quiz for you to test your knowledge on key data science terms, so you can feel good about applying these concepts to your work and communicate better with your data science colleagues!

When you’re done, visit our “Training” section to find more comprehensive information on cancer data science, and explore lists of resources that you can use on your data science journey.

Let’s get started!

Question 1: What is data cleaning?

A. Submitting your raw data to a repository, so that the repository staff can fix any errors in your data for you.
B. The process of fixing or removing data that’s inaccurate, duplicated, or outside the scope of your research question.
C. The process of removing data that doesn’t fit with your hypothesis from your results.

Answer:

The correct answer is B.

For example, you may have made a mistake during data entry or have inconsistent formatting, and this is the process of correcting those errors. Tip: make a copy of your raw data set before you begin cleaning your data. This way, you can go back to the original if you make a mistake during cleaning. For more information on this topic, visit our “Cleaning Data: The Basics” webpage.

Response A is incorrect because while it’s true that data repository staff do check submitted data for errors, inconsistencies, or missing information, you should not rely on repositories for your data cleaning.

Response C is incorrect because just because you’ve collected data that may not prove your hypothesis, doesn’t mean you should remove that data.

Question 2: True or False

Facilities collecting data on new cancer cases need to report those cases to a central cancer registry.

Answer:

The correct answer is True. This is required by law. A central cancer registry, such as a state registry, will require you to meet specific requirements to capture important cancer data. This might include histology findings, primary tumor site, and more. To learn more about this process, explore our “Generating and Collecting Data: The Basics” webpage.

Question 3: What are the three Cs of working with data?

A. Categorized, Consistent, Clean
B. Complete, Consistent, Correct
C. Configured, Controlled, Correct

Answer:

The correct answer is b. When working with data, you must ensure it is complete (not missing data), consistent (that the data you collected at the beginning of the study matches, in semantics and scope, data from the end of the study), and correct (absent of outliers and duplication). To learn more about working with data, visit our “Cleaning Data: The Basics” webpage.

Question 4: How does data exploration and analysis help you as you conduct your research and work with your data?

A. Data exploration and analysis helps you identify what you want to learn from the data, and then act towards understanding the meaning of the data.
B. Data exploration and analysis is when you search data repositories for data that you want to use for your research.
C. Data exploration and analysis is when you use statistical models or machine learning to test your hypothesis.

Answer:

The correct answer is A. When you explore your data, you’ll look for trends and patterns to help you form a hypothesis for further investigation. When you analyze your data, you’ll likely use statistical models or machine learning to test that hypothesis. Data repositories do hold data you may want to use in your own research, but that’s not what data scientists are referring to when they talk about data exploration and analysis. To learn more, visit our “Exploring and Analyzing Data: The Basics” webpage.

Question 5: How do you define consortium sharing?

A. When you share between yourself and another investigator, either upon publication or when the other investigator requests.
B. Sharing with the larger research communities, institutions, and public.
C. Sharing with large collaborative groups.

Answer:

The correct answer is C. Consortium sharing is sharing with a large collaborative group and only benefits a focused group. Learn more about the topic on our “Sharing Data: The Basics” webpage so that you’re ready to comply with guidelines and advance the field of cancer research. This is different than collaborator sharing, which only helps the individual, or broad sharing, which helps the community and ensures fair and equitable data access.

Question 6: True or False

Predictive models are like powerful calculators that help us better understand a patient.

Answer:

The correct answer is True. Predictive models can help you understand a patient by considering factors such as patient information, genetics, and treatment history. Find out what the two types of models are (and more) by visiting our “Predictive Modeling: The Basics” webpage.

Question 7: Which type of chart visualizes data through variation in coloring applied to a tabular format?

A. Network diagram
B. Pie chart
C. Scatterplot
D. Heat map

Answer:

The correct answer is D. You may want to use a heat map if you want to show values across multiple variables to reveal patterns. It’s a common chart for visualizing genomics data. Learn more about how charts are used in data visualization for cancer research on our “Data Visualization: The Basics” webpage.

A network diagram shows how things are interconnected by linking nodes of data with lines to represent their connections.

A pie chart breaks a circle into segments to illustrate proportions and percentages between categories.

A scatterplot places points on a Cartesian Coordinates system to show the relationship between two sets of data.

Question 8: How do you define secondary data sets?

A. Data sets generated by reusing primary data sets.
B. Data sets generated by running your experiment a second time.
C. Data sets that are a back-up to your primary data sets. The quality of these data may not be as good.

Answer:

The correct answer is A. Secondary data sets are data sets generated by the re-use of primary data sets. Data gathered by running an experiment again, or collecting new data from study participants, is not the same as secondary data sets. Having a lower quality, back-up set of data is not a standard practice in cancer research data science. Head to our “Sharing Data: The Basics” webpage to learn more about data sharing terms and concepts.

Question 9: Which chart is commonly used when presenting timelines in a grant proposal or funding request?

A. Histogram diagram
B. Network diagram
C. Gantt chart
D. Scatterplot

Answer:

The correct answer is C. A Gantt chart can display a list of activities or tasks with their duration over time for organizational purposes. Explore visualization charts on our “Visualizing Data: The Basics” webpage.

A histogram may be useful to compare age range data.

A network diagram can help you analyze relationships between cancer occurrences in various communities.

You might use a scatterplot to visualize dose response curves.

Question 10: You want to clean your data, but there’s a lot of it, and it’s an overwhelming task. You ask your data scientist colleague for advice, and they tell you to use Python. What do they mean?

A. Python is a tool that functions like a search engine, helping you find advice on how to resolve your situation.
B. Python is a programming language that can alleviate your workload and help you with the decision making process.
C. Your colleague is making a joke, suggesting a snake can help you.

Answer:

The correct answer is A. Python is a coding resource used by data scientists. It can help you clean data. For example, you can use Python to remove unnecessary columns, filter results, and validate data sets. To learn more, visit our “Cleaning Data: The Basics” webpage.

How did you do? If you got them all correct, congratulations! If you missed some, that’s okay too; your interest in learning more about cancer data science is the important part

Expand your data science knowledge in our suite of cancer data science how-to guides, video courses, and resources available for you in our Training section. Discover the difference it makes in your career.

Cancer Researchers: Do You ‘Speak Data Science’? Test Your Knowledge

Question 1: What is data cleaning?

Answer:

Question 2: True or False

Answer:

Question 3: What are the three Cs of working with data?

Answer:

Question 4: How does data exploration and analysis help you as you conduct your research and work with your data?

Answer:

Question 5: How do you define consortium sharing?

Answer:

Question 6: True or False

Answer:

Question 7: Which type of chart visualizes data through variation in coloring applied to a tabular format?

Answer:

Question 8: How do you define secondary data sets?

Answer:

Question 9: Which chart is commonly used when presenting timelines in a grant proposal or funding request?

Answer:

Question 10: You want to clean your data, but there’s a lot of it, and it’s an overwhelming task. You ask your data scientist colleague for advice, and they tell you to use Python. What do they mean?

Answer:

Archive