• overview

  • Course welcome
  • wrangle

  • Filter
  • Arrange
  • Mutate
    • Wrap up
    • visualize

    • Getting started
    • Geoms
    • Aesthetics
    • Facets
    • summarize

    • Summarize
    • Group by
    • Visualizing summaries
    • plot types

    • Line plots
    • Bar plots
    • Histograms
    • Boxplots

    Mutate

    The mutate verb

    You've already learned two verbs for manipulating data: filter for extracting a subset of the observations, and arrange for sorting them.

    Suppose you want to change one of the variables in your dataset, based on the other ones. Or suppose you want to add a new variable.

    For that, you would use the mutate verb.

    Using mutate to change a variable

    (music_top200
      >> mutate(streams = _.streams / 1000)
    )

    First you'll learn how to change an existing variable.

    You use mutate like you would filter or arrange, after a pipe operator.

    Inside this mutate statement, what's on the right of the equals sign is what's being calculated, and what's on the left is what's being replaced. Here, you're calculating streams slash one thousand.

    (Click the down arrow for a breakdown)

    Using mutate to change a variable

    (music_top200
      >> mutate(streams = _.streams / 1000)
    )
    

    The slash in Python performs division, so "streams divided by one thousand".

    Using mutate to change a variable

    (music_top200
      >> mutate(streams = _.streams / 1000)
    )
    

    On the left, you're saying you want to replace the existing streams column, by writing streams equals.

    Using mutate to change a variable (result)

    (music_top200
      >> mutate(streams = _.streams / 1000)
    )

    country position track_name artist streams duration continent
    0 Argentina 1 Tusa KAROL G 1858.666 200.960 Americas
    1 Argentina 2 Tattoo Rauw Alejandro 1344.382 202.887 Americas
    2 Argentina 3 Hola - Remix Dalex 1330.011 249.520 Americas
    ... ... ... ... ... ... ... ...
    12397 South Africa 198 Black And White Niall Horan 11.771 193.090 Africa
    12398 South Africa 199 When I See U Fantasia 11.752 217.347 Africa
    12399 South Africa 200 Psycho! MASN 11.743 197.217 Africa

    12400 rows × 7 columns

    The result is the same table, but with streams replaced with a new value, one that's much smaller than it was before.

    This is how you manipulate existing variables in the table. That's often necessary during data processing and cleaning.

    Just like filter and arrange, you're not altering the original gapminder data, you're just changing the value in this new data frame that's being returned.

    Using mutate to add a new variable

    (music_top200
      >> mutate(ttl_stream_time = _.streams * _.duration)
    )

    country position track_name artist streams duration continent ttl_stream_time
    0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas 3.735175e+08
    1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas 2.727576e+08
    2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas 3.318643e+08
    ... ... ... ... ... ... ... ... ...
    12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa 2.272862e+06
    12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa 2.554262e+06
    12399 South Africa 200 Psycho! MASN 11743 197.217 Africa 2.315919e+06

    12400 rows × 8 columns

    Alternatively, you may want to add a new variable.

    For instance, suppose you want to know the total seconds people have listened to a track.

    In the data you have a tracks duration in seconds, and its number of streams. This means total time listened is duration multiplied by streams.

    You would use mutate in almost the same way. You pipe your music data to the mutate verb.

    The asterisk in Python represents multiplication, so you write streams asterisk duration to multiply the two columns.

    Using mutate to add a new variable (result)

    (music_top200
      >> mutate(ttl_stream_time = _.streams * _.duration)
    )
    

    country position track_name artist streams duration continent ttl_stream_time
    0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas 3.735175e+08
    1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas 2.727576e+08
    2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas 3.318643e+08
    ... ... ... ... ... ... ... ... ...
    12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa 2.272862e+06
    12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa 2.554262e+06
    12399 South Africa 200 Psycho! MASN 11743 197.217 Africa 2.315919e+06

    12400 rows × 8 columns

    Notice that to be informative, we have named the column ttl_stream_time, that's what's to the left of the equals sign in our code.

    Column names need to be one word like this: you can't use spaces.

    Notice that in the results, there's a brand new ttl_stream_time column, which is much larger than streams.

    Answering a question

    (music_top200
      >> mutate(ttl_stream_time = _.streams * _.duration)
      >> filter(_.country == "Costa Rica")
      >> arrange(-_.ttl_stream_time)
    )

    Let's put together the three verbs you learned in this chapter to answer a question about our data.

    Suppose we wanted to know for a country, like Costa Rica, what songs were listened to longest.

    We would do this in three steps:

    • create the column
    • filter for country
    • then sort

    (Click the down arrow for a breakdown)

    Answering a question

    (music_top200
      >> mutate(ttl_stream_time = _.streams * _.duration)
    
    
    )

    country position track_name artist streams duration continent ttl_stream_time
    0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas 3.735175e+08
    1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas 2.727576e+08
    2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas 3.318643e+08
    ... ... ... ... ... ... ... ... ...
    12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa 2.272862e+06
    12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa 2.554262e+06
    12399 South Africa 200 Psycho! MASN 11743 197.217 Africa 2.315919e+06

    12400 rows × 8 columns

    We use mutate to create the ttl_stream_time column.

    Answering a question

    (music_top200
      >> mutate(ttl_stream_time = _.streams * _.duration)
      >> filter(_.country == "Costa Rica")
    
    )

    country position track_name artist streams duration continent ttl_stream_time
    2200 Costa Rica 1 Safaera Bad Bunny 338078 295.177 Americas 9.979285e+07
    2201 Costa Rica 2 Si Veo a Tu Mamá Bad Bunny 244932 170.972 Americas 4.187651e+07
    2202 Costa Rica 3 Ignorantes Bad Bunny 233113 210.607 Americas 4.909523e+07
    ... ... ... ... ... ... ... ... ...
    2397 Costa Rica 198 Ride It Regard 21731 157.606 Americas 3.424936e+06
    2398 Costa Rica 199 Sola Manuel Turizo 21704 195.044 Americas 4.233235e+06
    2399 Costa Rica 200 Nena Maldición (feat. Lenny Tavárez) Paulo Londra 21684 228.875 Americas 4.962926e+06

    200 rows × 8 columns

    Then we'd use filter to restrict our analysis to only Costa Rica.

    Answering a question

    (music_top200
      >> mutate(ttl_stream_time = _.streams * _.duration)
      >> filter(_.country == "Costa Rica") 
      >> arrange(-_.ttl_stream_time) 
    )

    country position track_name artist streams duration continent ttl_stream_time
    2200 Costa Rica 1 Safaera Bad Bunny 338078 295.177 Americas 9.979285e+07
    2202 Costa Rica 3 Ignorantes Bad Bunny 233113 210.607 Americas 4.909523e+07
    2222 Costa Rica 23 René Residente 101872 457.592 Americas 4.661581e+07
    ... ... ... ... ... ... ... ... ...
    2377 Costa Rica 178 Dónde Estás KHEA 23177 153.560 Americas 3.559060e+06
    2394 Costa Rica 195 Blueberry Faygo Lil Mosey 21771 162.547 Americas 3.538811e+06
    2397 Costa Rica 198 Ride It Regard 21731 157.606 Americas 3.424936e+06

    200 rows × 8 columns

    Then we'd use arrange to sort the data in descending order of our new ttl_stream_time variable.

    Answering a question (result)

    (music_top200
      >> mutate(ttl_stream_time = _.streams * _.duration)
      >> filter(_.country == "Costa Rica") 
      >> arrange(-_.ttl_stream_time) 
    )

    country position track_name artist streams duration continent ttl_stream_time
    2200 Costa Rica 1 Safaera Bad Bunny 338078 295.177 Americas 9.979285e+07
    2202 Costa Rica 3 Ignorantes Bad Bunny 233113 210.607 Americas 4.909523e+07
    2222 Costa Rica 23 René Residente 101872 457.592 Americas 4.661581e+07
    ... ... ... ... ... ... ... ... ...
    2377 Costa Rica 178 Dónde Estás KHEA 23177 153.560 Americas 3.559060e+06
    2394 Costa Rica 195 Blueberry Faygo Lil Mosey 21771 162.547 Americas 3.538811e+06
    2397 Costa Rica 198 Ride It Regard 21731 157.606 Americas 3.424936e+06

    200 rows × 8 columns

    This gives us the answer we're looking for.

    The most streamed track in Costa Rica is still the first in our result.

    However, it looks like the track normally in the 23rd position--René by Residente--is third, since it's over 7 minutes long!

    Let's practice!

    Exercise 0:

    Delete the # symbol on the line # music_top200. This symbol is called a comment, and causes everything to the right of it to be ignored.

    Exercise 1:

    Currently, the duration is calculated in seconds.

    • Delete (uncomment) the # symbol in the code.
    • Complete the code to convert duration to milliseconds.
    hint > Delete the # symbol. This is called a comment. When it is the first non-blank on a line, it stops that line from being run. > 1 second is 1000 milliseconds. So you would multiply the duration by 1000.
    countrypositiontrack_nameartiststreamsdurationcontinent
    0Argentina1TusaKAROL G1858666200.960Americas
    1Argentina2TattooRauw Alejandro1344382202.887Americas
    2Argentina3Hola - RemixDalex1330011249.520Americas
    ........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa
    12398South Africa199When I See UFantasia11752217.347Africa
    12399South Africa200Psycho!MASN11743197.217Africa

    12400 rows × 7 columns

    Exercise 2:

    Uncomment the mutate in the pipe below. Can you fix the two errors in it?

    It should return data with a new column on the right, that divides position by 200.

    countrypositiontrack_nameartiststreamsdurationcontinent
    0Argentina1TusaKAROL G1858666200.960Americas
    1Argentina2TattooRauw Alejandro1344382202.887Americas
    2Argentina3Hola - RemixDalex1330011249.520Americas
    ........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa
    12398South Africa199When I See UFantasia11752217.347Africa
    12399South Africa200Psycho!MASN11743197.217Africa

    12400 rows × 7 columns

    Exercise 3:

    The mutate below uses code from a future chapter to calculate how many letters are in each artist's name.

    Which artist with the shortest name has the most streams?

    hint > You'll need to use an arrange in the pipe with 2 arguments.
    countrypositiontrack_nameartiststreamsdurationcontinentartist_length
    0Argentina1TusaKAROL G1858666200.960Americas7
    1Argentina2TattooRauw Alejandro1344382202.887Americas14
    2Argentina3Hola - RemixDalex1330011249.520Americas5
    ...........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa11
    12398South Africa199When I See UFantasia11752217.347Africa8
    12399South Africa200Psycho!MASN11743197.217Africa4

    12400 rows × 8 columns

    Test yourself

    (click to answer)

    That's right!
    That's the right artist, but did you sort by most streams first?
    Incorrect. Did you sort by most streams first?
    Incorrect. Did you sort by most streams first?
    prev pagenext page