• overview

  • Course welcome
  • wrangle

  • Filter
  • Arrange
  • Mutate
  • Wrap up
  • visualize

  • Getting started
    • Geoms
    • Aesthetics
    • Facets
    • summarize

    • Summarize
    • Group by
    • Visualizing summaries
    • plot types

    • Line plots
    • Bar plots
    • Histograms
    • Boxplots

    Getting started

    Visualization with plotnine

    In the last chapter, you used the dplyr package to answer some questions about the gapminder dataset.

    You've been able to..

    • filter for particular observations.
    • arrange to find the highest or lowest values.
    • mutate to add new columns.

    However, most of the code so far has only printed out results as a table. Often a better way to understand and present this kind of data is as a graph.

    In this chapter, you'll learn the essential skill of data visualization using the plotnine package.

    Visualizing with plotnine

    png

    In particular, this chapter will show you how to create scatterplots, like the one you see here.

    Scatterplots compare two variables on an x- and y- axis.

    Visualization and data wrangling are often intertwined, so you'll see how the siuba and plotnine packages work closely together to create informative graphs.

    Importing plotnine

    
    from siuba import *
    from plotnine import *
    

    You'll be creating plots using the plotnine package.

    Just like the siuba package, you'll have to load it with import.

    Variables

    billie = (
      track_features
      >> filter(_.artist == "Billie Eilish")
    )

    In this chapter, you'll sometimes be visualizing subsets of the track_features data. For example, this code gets only tracks by the artist Billie Eilish.

    When you're working with just that subset, sometimes it's useful to save the filtered data, as a new data frame.

    To do this, you use the assignment operator.

    Click down for a breakdown of assignment

    Variables

    (
      track_features
      >> filter(_.artist == "Billie Eilish")
    )

    First, write out the pipe with the filter verb.

    This is the same as you did in the previous chapter.

    Note that whether the name of the data, track_features is right after the ( or on its own line, the code works the same.

    Variables

    
    billie = (
      track_features
      >> filter(_.artist == "Billie Eilish")
    )
    

    Then, write the variable name, followed by the variable operator (an equal sign).

    In this case, the variable name is billie.

    Variables (result)

    billie

    artist album track_name energy valence danceability speechiness acousticness popularity duration
    1273 Billie Eilish dont smile ... my boy 0.3940 0.3240 0.692 0.2070 0.472 44 170.852
    2899 Billie Eilish WHEN WE ALL... listen befo... 0.0561 0.0820 0.319 0.0450 0.935 79 242.652
    2950 Billie Eilish lovely (wit... lovely (wit... 0.2960 0.1200 0.351 0.0333 0.934 89 200.186
    ... ... ... ... ... ... ... ... ... ... ...
    24857 Billie Eilish WHEN WE ALL... ilomilo 0.4230 0.5720 0.855 0.0585 0.724 79 156.371
    24997 Billie Eilish WHEN I WAS ... WHEN I WAS ... 0.3320 0.0628 0.696 0.0425 0.853 71 270.520
    25147 Billie Eilish come out an... come out an... 0.3210 0.1770 0.640 0.0931 0.693 74 210.376

    27 rows × 10 columns

    Now if you print the billie dataset, we can see that it's another table.

    But this one has only 27 rows: one for each track by Billie Eilish in the original data.

    Visualizing with plotnine

    (billie
     >> ggplot(aes("energy", "valence"))
      + geom_point()
      + labs(title = "Billie Eilish hit track features")
    )

    png

    Suppose you want to examine the energy and valence of Billie Eilish's songs.

    You could do this with a scatterplot comparing two variables in our track_features dataset: energy on the x-axis and valence on the y-axis.

    There are three parts to a plotnine graph.

    (Click the down button to see a breakdown of the code)

    Visualizing with plotnine

    (billie
     >> ggplot(aes("energy", "valence"))
      + geom_point()
      + labs(title = "Billie Eilish hit track features")  
    )
    

    First is the data that we're visualizing. In this case, that is the billie variable you just created.

    Visualizing with plotnine

    (billie
     >> ggplot(aes("energy", "valence"))
      + geom_point()
      + labs(title = "Billie Eilish hit track features")
    )
    

    Second is the mapping of variables in your dataset to aesthetics in your graph.

    An aesthetic is a visual dimension of a graph that can be used to communicate information.

    In a scatterplot, your two dimensions are the x axis and the y axis, so you write aes (for "aesthetic"), parentheses, x equals gdpPerCap, y = lifeExp, telling it which variables to place on which axes.

    Visualizing with plotnine

    (billie
     >> ggplot(aes("energy", "valence"))
      + geom_point()
      + labs(title = "Billie Eilish hit track features")
    )
    

    The third step is specifying the type of graph you're creating. You do that by adding a layer to the graph: use a plus after the ggplot, and then geom underscore point.

    The "geom" means you're adding a type of geometric object to the graph, the "point" indicates it's a scatter plot, where each observation corresponds to one point.

    Together, these three parts of the code--the data, the aesthetic mapping, and the layer--construct the scatter plot you see here.

    Let's practice!

    Scroll down to get started with practice!

    Exercise 1:

    In this exercise, there are two code cells. The first defines variables for tracks by different artists. The second creates a plot.

    Read through the code and plot, and then modify it to answer the question beneath.

    artistalbumtrack_nameenergyvalencedanceabilityspeechinessacousticnesspopularityduration
    1431ITZYIT'z Different달라달라 (DALLA DALLA)0.8530.7130.7900.06650.0011673199.874
    21148ITZYIT'z Different달라달라 DALLA DALLA0.8530.7130.7900.06650.0011657199.874
    22388ITZYIt'z MeWANNABE0.9110.6400.8090.06170.0079581191.242
    25287ITZYIT'z ICYICY0.9040.8140.8010.08340.0324072191.142

    4 rows × 10 columns

    The code below plots hits for the roddy variable. Note that you could swap out roddy for any of the other two variables above.

    png

    Test yourself

    Who has the widest range of danceability? (i.e. biggist difference between highest and lowest)

    (click to answer)

    Try again.
    That's right!
    Try again. All the ITZY songs shown have roughly the same danceability.

    Exercise 2:

    Does it look like there any extremely popular songs over 15 minutes long?

    There is not one concrete answer to this question. Make a plot below, and come up with an answer you might share with another person.

    hint

    The duration column contains the length of each song in seconds. Use this with the popularity column.

    png

    possible answersscreencast

    Exercise 3:

    Does the lowest energy track belong to a "low energy" artist? In this exercise, we'll explore the questions using tracks by two artists.

    Here is the track data sorted by energy.

    artistalbumtrack_nameenergyvalencedanceabilityspeechinessacousticnesspopularityduration
    1003Simon SmithLoopsBlagaslavlaju vas0.0007780.0000.7790.42100.99400036.038
    5995DMSPrepáčteNič0.0007910.0000.5710.44600.950002537.355
    16689Peter SimonSnowrainSnowrain0.0034800.3730.4720.05170.99600031.000
    .................................
    22695Nino XypolitasEpireastikaEime Enas Allos - Original0.9960000.5170.6440.10300.0034634214.693
    17072OtiraSoundboy Burnin’Soundboy Burnin’0.9970000.3270.5680.23300.0029914173.846
    11069ScooterNo Time To ChillHow Much Is the Fish?0.9990000.6150.5330.07860.0013048226.200

    25321 rows × 10 columns

    Notice that Simon Smith has the lowest energy song ("Blagaslavlaju vas"), while Scooter has the highest energy song ("How Much is the Fish?").

    First, filter the track_features data to create a variable named artist_low that has only tracks by the artist Simon Smith.

    Next, create a variable named artist_high with tracks by the artist Scooter, who has the highest energy song.

    Based on separate plots of their data, does the artist with the lowest energy track seem to have lower energy songs in general?

    ⚠️: Don't forget to replace all the blanks!

    possible answer

    The high energy artist, Scooter, seems to only have high energy songs (from about .9 to 1 energy).

    On the other hand, the low energy artist, Simon Smith, seems to have a wide range of energy values (from about 0 to 1 energy).

    prev pagenext page