• overview

  • Course welcome
  • wrangle

  • Filter
    • Arrange
    • Mutate
    • Wrap up
    • visualize

    • Getting started
    • Geoms
    • Aesthetics
    • Facets
    • summarize

    • Summarize
    • Group by
    • Visualizing summaries
    • plot types

    • Line plots
    • Bar plots
    • Histograms
    • Boxplots

    Filter

    The filter verb

    Now that you've seen some code in action on the music data, we'll focus on writing your own code.

    In this chapter, you'll learn about the "verbs" in the siuba package. The first verb you'll use is filter.

    The filter verb

    You use filter when you want to keep only a subset of your observations, based on a particular condition.

    Filtering data is a common first step in an analysis.

    Filter for top songs

    (music_top200
      >> filter(_.position == 1)
    )

    Every time you apply a verb, you'll use a pipe. A pipe is a block of commands, surrounded by parentheses.

    For example, let's say we want to keep only songs in the first position on the music charts. This is done by using the pipe shown on the slide.

    Click the down arrow on the slideshow to see broken into 3 steps:

    1. start the block
    2. write the pipe operator and verb name
    3. write the operation

    Filter step 1: start the block

    (music_top200
    
    )

    A block is written with opening and closing parentheses, and the name of your data in the middle.

    Press enter twice after the name of your data, to make an empty line.

    Filter step 2: pipe operator and verb name

    (music_top200
      >> filter()
    )

    Next, write a pipe operator using two greater than signs. It says "take whatever is before the pipe operator, and feed it into the next step."

    In this case, the next step will be filter.

    Filter step 3: write the operation

    (music_top200
      >> filter(_.position == 1)
    )

    Finally, we can complete our first verb.

    We have all 200 hit songs on the charts, but just want to get the first.

    The "position equals equals 1" is the condition we are using to filter observations. The "equals equals" may be surprising: it's what we call a "logical equals"- an operation to compare two values: each position, and the number 1.

    A single equals here would mean something different in python, which you'll see later.

    Let's see what this code outputs.

    Filter for top songs

    (music_top200
      >> filter(_.position == 1)
    )

    country position track_name artist streams duration continent
    0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
    200 Austria 1 Blinding Lights The Weeknd 229576 201.573 Europe
    400 Australia 1 Blinding Lights The Weeknd 1757343 201.573 Oceania
    ... ... ... ... ... ... ... ...
    11800 Uruguay 1 Tusa KAROL G 120175 200.960 Americas
    12000 Viet Nam 1 Sweet Night V 189261 214.259 Asia
    12200 South Africa 1 The Box Roddy Ricch 94422 196.653 Africa

    62 rows × 7 columns

    Notice that now, we have only 62 rows: that's how many countries are in the dataset.

    It's important to note that you're not removing any rows from the original music data. You can still use the music object for other analyses, and it won't be any different than it was before.

    Instead, filter is returning a new dataset, one with fewer rows, that then gets printed to the screen.

    Filter for country

    (music_top200
      >> filter(_.country == "United States")
    )

    country position track_name artist streams duration continent
    7800 United States 1 The Box Roddy Ricch 12987027 196.653 Americas
    7801 United States 2 Myron Lil Uzi Vert 9163134 224.955 Americas
    7802 United States 3 Blueberry Faygo Lil Mosey 8043475 162.547 Americas
    ... ... ... ... ... ... ... ...
    7997 United States 198 Lights Up Harry Styles 1606234 172.227 Americas
    7998 United States 199 Without Me Halsey 1606153 201.661 Americas
    7999 United States 200 Enemies (feat. DaBaby) Post Malone 1597824 196.760 Americas

    200 rows × 7 columns

    You could choose another condition to filter on, besides the position. For example, suppose we wanted to get only the observations from the United States.

    We would write this as "filter country equals equals quote United States endquote", resulting in only the 200 observations from that country.

    The quotes around United States are important: otherwise Python won't understand that the words "Hong" and "Kong" are the content of a text variable, as opposed to variable names. You didn't need quotes around a number like 1, but you do around text.

    Filter with two variables

    (music_top200
      >> filter(_.position == 1, _.country == "United States")
    )

    country position track_name artist streams duration continent
    7800 United States 1 The Box Roddy Ricch 12987027 196.653 Americas

    We can specify multiple conditions in the filter.

    Each of the conditions is separated by a comma: here we are saying we want only the one observation for position 1, comma, where the country is the United States.

    Each of these equals equals expressions is called an argument. This kind of double filter is useful for extracting a single observation you're interested in.

    You'll be able to practice this in the exercises.

    Let's practice!

    Scroll down to investigate the music data with filter.

    Exercise 1:

    The code below is filtering to keep only hits where country is United States. Change the filter to get hits from Canada.

    countrypositiontrack_nameartiststreamsdurationcontinent
    7800United States1The BoxRoddy Ricch12987027196.653Americas
    7801United States2MyronLil Uzi Vert9163134224.955Americas
    7802United States3Blueberry FaygoLil Mosey8043475162.547Americas
    ........................
    7997United States198Lights UpHarry Styles1606234172.227Americas
    7998United States199Without MeHalsey1606153201.661Americas
    7999United States200Enemies (feat. DaBaby)Post Malone1597824196.760Americas

    200 rows × 7 columns

    Test yourself

    Comparing results, which artist is in the top 3 in both (the) United States and Canada?

    (click to answer)

    Nailed it!
    That artist is only top 3 in the United States
    That artist is only top 3 in the Canada

    Exercise 2:

    • Filter to keep tracks where position equals 5.

    ⚠️: Don't forget to replace all the blanks!

    Test yourself

    Which artist is in position 5 in South Africa?

    (click to answer)

    That's right!

    Exercise 3:

    Find the top 5 songs in Hong Kong.

    In the slides we discussed the == operator. Here is a longer list of some options!

    operatormeaning
    ==is equal to
    <is less than
    >is greater than

    ⚠️: Don't forget to replace all the blanks!

    Exercise 4: looking at The Weeknd's streams

    How many times has The Weeknd had over 1,000,000 streams?

    Hint: Do in steps. Run first to get all rows where the artist is The Weeknd, and then modify your code to get where he has over 1,000,000 streams.
    ()
    
    prev pagenext page