• overview

  • Course welcome
  • wrangle

  • Filter
  • Arrange
  • Mutate
  • Wrap up
  • visualize

  • Getting started
  • Geoms
  • Aesthetics
  • Facets
  • summarize

  • Summarize
    • Group by
    • Visualizing summaries
    • plot types

    • Line plots
    • Bar plots
    • Histograms
    • Boxplots

    Summarize

    Summarizing data

    In this chapter, you'll return to the topic of data transformation with siuba to learn more ways to explore your data.

    Data analysis

    Analyses will usually involve a cycle between these steps of data transformation and visualization, as well as additional components of the data science workflow, like modeling (that this course won't focus on).

    Extracting data

    (music_top200
      >> filter(_.country == "Japan", _.position == 1)
    )

    country position track_name artist streams duration continent
    6400 Japan 1 I LOVE... Official HIGE DANdism 1591844 282.027 Asia

    You've learned to use the filter verb to pull out individual observations, such as the track in the first position for Japan.

    Notice that the track's duration is about 280 seconds. Is this is shorter or longer than the average track.

    You'll learn to answer this question by summarizing many observations into a single data point.

    The summarize verb

    (music_top200
      >> summarize(avg_duration = _.duration.mean())
    )

    avg_duration
    0 205.990073

    You might want to know if the 280 seconds we saw is longer than the average track across all countries.

    You would do this with the summarize verb. Notice that summarize collapses the entire table down into one row.

    In the output, we see the answer to our question: mean (or average) duration was around 205 seconds.

    (Click down for a breakdown of the code)

    The summarize verb

    (music_top200
      >> summarize(avg_duration = _.duration.mean())
    )

    avg_duration
    0 205.990073

    Take your music_top200 data, pipe it into summarize.

    Then, specify that you're creating a summary column called avg_duration.

    Notice that the highlighted code is similar to when created a new column with the mutate verb.

    The summarize verb

    (music_top200
      >> summarize(avg_duration = _.duration.mean())
    )

    avg_duration
    0 205.990073

    The "duration dot mean, followed by open and close parentheses" is worth examining.

    This is calling the method mean on the variable duration. The mean method calculates the average of a set of values.

    Summarizing one country

    (music_top200
      >> filter(_.country == "Japan")
      >> summarize(avg_duration = _.duration.mean())
    )

    avg_duration
    0 250.53499

    If you think about it, it doesn't really make sense to summarize across all countries. It may make more sense to ask compare our track to averages in the same country: Japan.

    To answer this, you can combine the summarize verb with filter: filter your data for a particular country first, then summarize the result. This shows you that the average track duration in the Japan top 200 chart was about 250 seconds.

    Summarizing into multiple columns

    (music_top200
      >> filter(_.country == "Japan")
      >> summarize(
          avg_duration = _.duration.mean(),
          ttl_streams = _.streams.sum()
      )
    )

    avg_duration ttl_streams
    0 250.53499 48942067

    You can create multiple summaries at once with the summarize verb.

    For example, suppose that along with finding the average track duration in Japan, you want to find their chart's total streams.

    To do that, add a comma after the mean of the duration. Then, specify another column you're creating.

    You could give it a useful name like ttl_streams, and say that it's equal to the sum--that's another built-in function--of the pop variable.

    Methods for summarizing

    E.g. _.some_column.mean()

    • .mean()
    • .sum()
    • .median()
    • .min()
    • .max()

    The mean and sum are just two methods you could use to summarize a variable within a dataset.

    Another example is median: the median represents the point in a set of numbers where half the numbers are above that point and half of the numbers are below.

    Two others are min, for minimum, and max, for maximum.

    In the exercises, you'll use several of these functions to answer questions about the music_top200 dataset.

    Let's practice!

    Exercise 1:

    The code below calculates the average duration.

    • Uncomment the summarize verb.
    • Change it to calculate median duration.
    • Make sure to change the resulting column name to indicate its a median.
    countrypositiontrack_nameartiststreamsdurationcontinent
    0Argentina1TusaKAROL G1858666200.960Americas
    1Argentina2TattooRauw Alejandro1344382202.887Americas
    2Argentina3Hola - RemixDalex1330011249.520Americas
    ........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa
    12398South Africa199When I See UFantasia11752217.347Africa
    12399South Africa200Psycho!MASN11743197.217Africa

    12400 rows × 7 columns

    Q: what is the median duration?

    answer201.084

    Q: Add a second argument to summarize, which calculates the sum of streams. How large is it?

    answer301,822,525

    Exercise 2:

    Use verbs you learned in chapter 1 to do the following:

    • find the track with the lowest duration
    • subset the data to keep only the row for that track

    (Note, you may need to run code multiple times)

    countrypositiontrack_nameartiststreamsdurationcontinent
    0Argentina1TusaKAROL G1858666200.960Americas
    1Argentina2TattooRauw Alejandro1344382202.887Americas
    2Argentina3Hola - RemixDalex1330011249.520Americas
    ........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa
    12398South Africa199When I See UFantasia11752217.347Africa
    12399South Africa200Psycho!MASN11743197.217Africa

    12400 rows × 7 columns

    Now, use summarize to calculate the min duration, and the max duration directly.

    countrypositiontrack_nameartiststreamsdurationcontinent
    0Argentina1TusaKAROL G1858666200.960Americas
    1Argentina2TattooRauw Alejandro1344382202.887Americas
    2Argentina3Hola - RemixDalex1330011249.520Americas
    ........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa
    12398South Africa199When I See UFantasia11752217.347Africa
    12399South Africa200Psycho!MASN11743197.217Africa

    12400 rows × 7 columns

    Test yourself

    Why would you use summarize like this, rather than the arrange and filter approach?

    (click to answer)

    That's right. Filter keeps specific rows, but summarize can collect values from across rows.
    Try again. Filter keeps all the variables (columns), while summarize will remove most of them.
    Try again.

    Exercise 3:

    The examples below show what happens verbs like filter and mutate use methods like .mean().

    countrypositiontrack_nameartiststreamsdurationcontinentavg_streams
    0Argentina1TusaKAROL G1858666200.960Americas243405.2625
    1Argentina2TattooRauw Alejandro1344382202.887Americas243405.2625
    2Argentina3Hola - RemixDalex1330011249.520Americas243405.2625
    ...........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa243405.2625
    12398South Africa199When I See UFantasia11752217.347Africa243405.2625
    12399South Africa200Psycho!MASN11743197.217Africa243405.2625

    12400 rows × 8 columns

    countrypositiontrack_nameartiststreamsdurationcontinent
    108Argentina109Me GustaCiro y los Persas243159289.093Americas
    109Argentina110Tal VezPaulo Londra242870264.483Americas
    110Argentina111PhysicalDua Lipa239225193.829Americas
    ........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa
    12398South Africa199When I See UFantasia11752217.347Africa
    12399South Africa200Psycho!MASN11743197.217Africa

    9341 rows × 7 columns

    Based on the examples above, can you use only the filter verb to get the most streamed song in all the data?

    countrypositiontrack_nameartiststreamsdurationcontinent
    0Argentina1TusaKAROL G1858666200.960Americas
    1Argentina2TattooRauw Alejandro1344382202.887Americas
    2Argentina3Hola - RemixDalex1330011249.520Americas
    ........................
    12397South Africa198Black And WhiteNiall Horan11771193.090Africa
    12398South Africa199When I See UFantasia11752217.347Africa
    12399South Africa200Psycho!MASN11743197.217Africa

    12400 rows × 7 columns

    countrypositiontrack_nameartiststreamsdurationcontinent
    7800United States1The BoxRoddy Ricch12987027196.653Americas

    1 rows × 7 columns

    What is the most streamed song?The Box by Roddy Ricch
    prev pagenext page