14 M4A: Find a Dataset Relevant To You
14.1 Introduction
Connection activities play a big part in this class. Understanding and application of data science are both great, but neither is very useful if you can’t bring them into a specific situation and make them work for you. With that in mind, most future connection activities (and all of your projects) are going to expect that you have access to a data set that you feel a personal or professional connection to. Ideally, you’d find a data set that is connected to your (present, past, or hoped-for future) professional context, whether that’s retrieved from your workplace (with permission, of course) or at least contextually relevant. However, I’ve also enjoyed seeing 661 students work with personally meaningful datasets related to Pokémon, Spotify, sports, or other interests. While I’m going to encourage you to pick something professionally relevant, the most important thing is that you work with something interesting.
This activity is all about finding one such relevant data set. If you find a solid data set today (one that meets the explicit and implicit requirements of projects and activities), you may be able to stick with it throughout the rest of the semester. However, if you get bored with a data set—or realize that your go-to data set won’t work for a particular activity—you can always find something else.
14.2 Signs of a Good Dataset
Before we get into the details, here are some important characteristics you’ll want to look for in data sets:
- at least 100 observations (rows) and 5 variables (columns)
- depending on the activity, you’ll probably want at least 5 quantitative variables (i.e., numbers); other variables are great, but they aren’t compatible with a lot of the statistical techniques we’ll be talking about
- although we challenge this idea in our class, more data is often better from a statistical point of view; if you can find a dataset with at least 1,000 observations and 10 variables, that will probably make things easier for you
- stored in a spreadsheet file format
- personally, I prefer
.csv
; I find it easier to work with, and it’s standard in data science circles .xsl
and.xslx
Excel spreadsheets are also common and supported by data import functions; however, I’ll be focusing on.csv
when providing assistance, so you will either have to do some extra research to adapt my code, or you’ll want to convert your spreadsheet to.csv
- personally, I prefer
- “tidy” data
- if you can find datasets that advertise themselves as tidy (1 observation per row, 1 variable per column), that will probably save you some hassle
- that said, we won’t cover tidy data until later in the semester, and it’s not always easy to tell whether another party’s dataset is tidy, so don’t stress too much about this
14.3 Finding a Good Dataset
Okay, now for the finding! There are plenty of sources for interesting data on the internet. The Kaggle collection of datasets is full of freely-available data ranging from the serious to the silly. It’s a great place to look if you don’t know where else to go. Governments of all kinds and sizes also collect data: You might search for federal, state, or local collections of “open data” if you think there might be something interesting for you there. A workplace is a good place to check for data so long as you can be sure that you have permission to do so. Finally, there are a lot of datasets associated with R packages. You can do some searching for one of those right now or pay close attention to the ones that we meet over the next few weeks; you might find one that you want to return to!
Once you’ve found the dataset you want to work with (at least for now), give it a short-but-descriptive name. If it’s a file format other than a .csv
, you should either convert it into a .csv
or be ready to look up how to load a file in another format into R, since I won’t be providing those directions. Once the file is ready to be used, move it to the activity_data
folder inside the project folder that you previously created for this class. Then, open up GitHub Desktop to commit this change to your project and push the changes so that they sync with the GitHub website.
I mentioned above that you’re free to stick with this data throughout the rest of the semester or that you can switch to new datasets whenever you feel like it. If you ever switch to a new dataset, make sure that you follow the steps above (giving it a short-but-descriptive name, ensuring it’s a .csv
or that you know how to load other data types, placing it in your activity_data
folder, and committing and pushing your changes in GitHub Desktop) every time.
14.4 Loading Your Dataset
I’ve already emphasized in a couple of walkthroughs up to this point that it’s important to be attentive to details when working in R. This is especially true when it comes to loading data into R. Up to this point, we have only worked with data that is preloaded in R packages; so long as you’ve properly loaded the package in question, you’ve loaded the data as well. A lot of our walkthroughs will only ever ask you to work with this data. However, later in the course, you’ll be adapting code from the walkthroughs to work with your own data; more importantly, you’ll be working with your own data when completing course projects.
When you’re working with your own data, you need to make sure it gets loaded into RStudio before your code can work with it. Let’s practice doing that now. We’re going to learn to do this through code. There is also a way to do this through the RStudio interface instead of typing out the code, but I’m not going to cover it here. My reasons for doing so are multiple. To begin with, I always load my data through code, so I honestly don’t know how to do it the other way off the top of my head.
Of course, I could look it up, but I also have other strong reasons for not covering this. Loading your data through the interface may be more straightforward when you’re completing walkthroughs and one-off activities, but:
- it is not reproducible, and it makes it harder for me to troubleshoot your code if you need any help, and
- it does not play nice with the
.Rmd
documents that you will use to submit your class projects, so you will have to load through code for the projects
To practice loading data, start by creating a new script in RStudio. Last week, we emphasized that saving your code in scripts is really helpful for referencing it later. You might want to give this script a name like data_loading.R
or something like that, to make it easy to reference in the future in case you need a refresher on how to load your own data. Please note that with future activities, I’m going to assume that you’re creating a new script for each activity and that you’re giving it a descriptive name—you won’t get advice like this in all walkthroughs! However, we’re still in the early stages of working with R, so a reminder here might be helpful!
We’re going to use two packages as part of our loading process: tidyverse
and here
. Chances are that you don’t have these packages installed yet, so you can do that now by running install.packages("tidyverse")
and install.packages("here")
. I recommend running this code in the Console. I’ve put a lot of emphasis on saving code in a script, but there’s always code that is so unimportant that it doesn’t need to be saved. This includes installing packages, which is almost always a one-and-done thing that doesn’t need to be repeated (at least, not for a while). Besides, if you include installation code in the .Rmd
files for your class projects, it creates all sorts of problems.
Now, let’s go ahead and load those packages:
We’re going to cover the tidyverse
package in more detail in a couple of weeks, so for now, just trust me that it’s important and necessary. The here
package, on the other hand, is worth getting into now. For any program on a computer to work with a file, it has to know where that file is so that it can access it. Modern programs on modern computers make this process so easy for us that we often don’t have to pay attention to the details of where a file is. R, on the other hand, is not so generous; it usually needs specific instructions for where to find a file before it can load it.
The point of the here
package is to make giving those specific instructions a bit easier for us. Let’s see what I mean by running the function here()
now that we have the package of the same name loaded:
[1] "/Users/spencergreenhalgh/spg_website/static/ICT_LIS_661_textbook_2023_Fall"
This is another case where the output that you see in the book is different than the output you’ll see where you run this code—and that’s the point of here()
! What this function does is to figure out the filepath for the folder that contains the R project you’re working in. So, the output that I get when I run here()
is the location of the R project associated with this textbook; it’s also formatted the way that macOS (the operating system I’m currently using) formats locations. You ought to get output that tells you the location of the R project that you set up for this class and that formats it in the way that your operating system formats locations. Think of here()
as a shortcut that takes you to the main folder that you’re working in; that way, you only have to provide the extra detail beyond that. (At the risk of repeating myself, though, this emphasizes the importance of finding and working in the expected main folder. If you’re not doing that, here()
won’t be as helpful.)
For example, if you’ve followed the directions in the previous section, you should have your dataset for this activity saved in the activity_data
folder. I also have a dataset—called Twitter_hashtags.csv
—saved in my activity_data
folder. If I run the following code, it tells R that I I’m looking inside my project folder (what the here()
function figures out on its own) to find a subfolder called activity_data
inside that project folder, to find a file called Twitter_hashtags.csv
inside that subfolder. The result is the exact location of the file on my computer, providing a lot more information that I didn’t have to type out on my own!
[1] "/Users/spencergreenhalgh/spg_website/static/ICT_LIS_661_textbook_2023_Fall/activity_data/Twitter_hashtags.csv"
When you run this code, it’s unlikely that your dataset is named Twitter_hashtags.csv
, so you should replace that with the name of your file. Likewise, as we saw above, your output should also look different because the location of your project folder is different than the location of mine! No matter how different your output looks, though, you ought to notice how much shorter your here()
argument is than the exact location that you might otherwise have to type out. That’s why here()
is useful!
In fact, let’s get to the good stuff. Now that I have the exact location of my dataset on my computer, I can put that inside the read_csv()
function, which (as the name suggests), loads .csv
files into RStudio. With the following code, I’m going to read my data in and save it as an object called df
(short for dataframe—it’s a lazy-but-handy default name for the object containing your data).
Rows: 60 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): hashtags
dbl (21): avg_hashtags_per_tweet, avg_mentions_per_tweet, proportion_origina...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You should run this code yourself. You can give your object a more descriptive name than df
if you want, but you absolutely have to change Twitter_hashtags.csv
to the name of your file if you want this to work! Once again, you should expect your output to look different than mine; my output is a quick summary of my data; your output will be a quick summary of your data.
The explanation-to-code ratio of this walkthrough is pretty high—higher than is ideal. The important part is that you remember this last line of code that you ran; so long as you save your data to the activity_data
subfolder in your project folder and replace Twitter_hashtags.csv
with the name of your dataset, this code will be a reliable way for you to load data into R. Knowing how to do this will be very useful for you for class projects and other activities. The rest of the explanation is just to emphasize how important this line of code is!