• overview

  • Course welcome
    • wrangle

    • Filter
    • Arrange
    • Mutate
    • Wrap up
    • visualize

    • Getting started
    • Geoms
    • Aesthetics
    • Facets
    • summarize

    • Summarize
    • Group by
    • Visualizing summaries
    • plot types

    • Line plots
    • Bar plots
    • Histograms
    • Boxplots

    Course welcome

    Introduction to siuba

    Hi, I'm Michael--the creator of siuba, a library for data analysis in Python.

    In this course, you'll use Python to explore and visualize data.

    The course is designed for people who have never programmed before, and anyone interested in siuba!

    Data Analysis

    Data analysis is the act of using data to produce effective communication. Communication that leads to meaningful action.

    In this course, we'll focus on one area of data analysis where siuba shines: transforming data.

    We'll also combine siuba with a library called plotnine to visualize data. Together, these libraries make a powerful combo.

    Meet the data: Spotify top 200

    We'll use data that Spotify publishes every week on the 200 most streamed songs, in 62 different countries.

    You can check it out on spotifycharts.com

    Meet the data: Spotify top 200

    music_top200

    country position track_name artist streams duration continent
    0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
    1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
    2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
    ... ... ... ... ... ... ... ...
    12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
    12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
    12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

    12400 rows × 7 columns

    The data we'll use is held in a DataFrame. A DataFrame is a big table of data, made up of rows and columns. In the example below, the variable called music_top200 lets us refer to and work on the data.

    Notice that in the bottom-left of the table, it shows the number of rows and columns. This data has 12,000 rows and 6 columns.

    Meet the data: Spotify top 200

    music_top200
    country position track_name artist streams duration continent
    0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
    1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
    2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
    ... ... ... ... ... ... ... ...
    12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
    12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
    12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

    Every observation--or row--in the DataFrame is a track in the top 200 for a country.

    For example, in the highlighted code is the track in position 2 (second most streamed) in Argentina. The track name is Tattoo, and has been streamed 1,344,382 times, and is 202 seconds long!

    Meet the data: Spotify song features

    Later in the course, we'll look at measures Spotify calculates for each song.

    Their measures for a song include:

    • energy
    • valence (how happy or positive)
    • danceability
    • speechiness
    • acousticness

    Data Analysis

    The skills you'll build doing transformation and visualization in this course will allow you to start analyzing your own data.

    The course is interactive: between short lessons you'll complete interactive exercises by typing your own code.

    How code is structured

    (track_features
      >> filter(_.artist == "The Weeknd")
      >> ggplot(aes("energy", "valence"))
       + geom_point()
    )

    png

    Here's an example of plotting the energy and valence of tracks by The Weeknd. Each point is a single track (for example, the song "Blinding Lights").

    The top line of the code is the data. Every additional line is an action applied to the data.

    Click down to see what each line of code does.

    How code is structured

    (track_features
      >> filter(_.artist == "The Weeknd")
    
    
    )

    artist album track_name energy valence danceability speechiness acousticness popularity duration
    568 The Weeknd My Dear Melancholy, Call Out My Name 0.593 0.175 0.461 0.0356 0.17000 82 228.373
    2753 The Weeknd Blinding Lights Blinding Lights 0.796 0.345 0.513 0.0629 0.00147 75 201.573
    3004 The Weeknd In Your Eyes (Remix) In Your Eyes (with Doja Cat) - Remix 0.731 0.727 0.679 0.0319 0.00518 81 237.912
    ... ... ... ... ... ... ... ... ... ... ...
    23966 The Weeknd Beauty Behind The Madness The Hills 0.564 0.137 0.585 0.0515 0.06710 83 242.253
    24688 The Weeknd Starboy Starboy 0.587 0.486 0.679 0.2760 0.14100 84 230.453
    24982 The Weeknd After Hours In Your Eyes 0.719 0.717 0.667 0.0346 0.00285 91 237.520

    23 rows × 10 columns

    The first action is filtering the data, to keep only observations (rows) where artist is The Weeknd.

    Don't worry to much about the details for now. Filter is the first thing you'll learn, once you start the first chapter!

    How code is structured

    (track_features
      >> filter(_.artist == "The Weeknd")
      >> ggplot(aes("energy", "valence"))
    
    )

    png

    The next action, ggplot(...), gets ready to make a plot, based on the data in the previous step.

    How code is structured

    (track_features
      >> filter(_.artist == "The Weeknd")
      >> ggplot(aes("energy", "valence"))
       + geom_point()
    )

    png

    Finally, geom_point() adds points to the plot. Each point comes from a row of data.

    So in this case each point is a track by The Weeknd.

    Let's practice!

    For the practice below, you'll get to explore the two datasets by testing out different options.

    This is to make sure you get the big picture, before we dive into the specifics of how each piece of code works!

    Exercise 1: inspecting music data

    Use the dropdown box below to change the code. Try choosing "United States" from the dropdown, then click run. This should return only the top 200 hits from the United States.

    • :
    countrypositiontrack_nameartiststreamsdurationcontinent
    7800United States1The BoxRoddy Ricch12987027196.653Americas
    7801United States2MyronLil Uzi Vert9163134224.955Americas
    7802United States3Blueberry FaygoLil Mosey8043475162.547Americas
    ........................
    7997United States198Lights UpHarry Styles1606234172.227Americas
    7998United States199Without MeHalsey1606153201.661Americas
    7999United States200Enemies (feat. DaBaby)Post Malone1597824196.760Americas
    Test yourself

    Which artist has a track in the second position on the United States charts?

    (click to answer)

    Try again. That artist is in the first position.
    That's right!
    Try again. That artist is the second from last position (198).

    Exercise 2: inspecting track_features data

    Use the options below, to examine tracks by different artists. Can you find the options that order tracks from highest energy to lowest?

    • :
    • :
    artistalbumtrack_nameenergyvalencedanceabilityspeechinessacousticnesspopularityduration
    22431The WeekndAfter Hours (Deluxe)Missed You - Bonus Track0.3640.44800.7160.08660.1070048144.540
    3889The WeekndAfter Hours (Deluxe)Nothing Compares - Bonus Track0.5770.03980.5240.03580.0025349222.307
    17384The WeekndHeartlessHeartless0.7500.19800.5310.11100.0063260200.080
    .................................
    9284The WeekndAfter HoursAfter Hours0.5720.14300.6640.03050.0811084361.027
    24689The WeekndStarboyStarboy0.5870.48600.6790.27600.1410084230.453
    24983The WeekndAfter HoursIn Your Eyes0.7190.71700.6670.03460.0028591237.520

    Exercise 3: Plotting track features

    Here is one kind of plot you will learn to make in the course.

    Once you've tried some options, scroll down and click next to get started with the first chapter!

    • :
    • :

    png

    prev pagenext page