Joining Data Sets

JReezy · November 1, 2020, 1:17am

I have two data sets that I merged together using the dplyr::left_join function. When doing this, the common named columns "joined" properly. My last column that had no common named columns, which is also a numerical column, turned all my numerical values into NAs. How can I keep my numerical values while joining the two datasets

andresrcs · November 1, 2020, 1:24am

Can you please share a small part of the data sets in a copy-paste friendly format?

In case you don't know how to do it, there are many options, which include:

If you have stored the data set in some R object, dput function is very handy.
In case the data set is in a spreadsheet, check out the datapasta package. Take a look at this link.

JReezy · November 1, 2020, 1:58am

Unfortunately, I am not catching how to do the last steps in the video. I'm assuming that what I have put down below will not work:

Blocks_Per_Game Turnovers_Per_Game Offensive_Rating Defensive_Rating Salary
1 1.06 1.51 122.0 101.9 NA
2 1.29 2.82 116.2 102.2 NA
3 1.64 1.40 114.7 109.1 NA
4 1.31 1.10 131.3 101.0 NA
5 1.05 3.65 116.1 90.2 NA
6 0.47 1.72 103.2 109.0 NA

JReezy · November 1, 2020, 1:59am

I can see why it wouldn't.. Those NAs are supposed to be under salary, for example

andresrcs · November 1, 2020, 2:02am

I think it would be better if you could prepare a reproducible example (reprex) illustrating your issue. Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

JReezy · November 1, 2020, 2:23am

head(nba_salaries2, 10)[,c("Players", "Salary", "Rank")]
#> Error in head(nba_salaries2, 10): object 'nba_salaries2' not found
datapasta::df_paste(head(nba_salaries2, 10)[,c("Players", "Salary", "Rank")])
#> Error in head(nba_salaries2, 10): object 'nba_salaries2' not found
nba_reg <- data.frame(
  stringsAsFactors = FALSE,
                      Players = c("Stephen Curry","Chris Paul","Russell Westbrook",
                                  "John Wall","James Harden","LeBron James",
                                  "Kevin Durant","Blake Griffin","Kyle Lowry",
                                  "Paul George"),
                       Salary = c(40231758,
                                  38506482,38178000,37800000,37800000,
                                  37436858,37199000,34234964,33296296,33005556),
              Rank = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
           )

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
nba_stats4 <- dplyr::left_join(nba_stats3, nba_salaries2)
#> Error in dplyr::left_join(nba_stats3, nba_salaries2): object 'nba_stats3' not found

^{Created on 2020-10-31 by the reprex package (v0.3.0)}

JReezy · November 1, 2020, 2:25am

Did it work?? I'm feeling pretty good about learning something new here

andresrcs · November 1, 2020, 2:27am

In order to make your example reproducible, you have to provide sample data for nba_stats3 and nba_salaries2

JReezy · November 1, 2020, 2:31am

So, should I do a reprex for nba_stats3 now? Was the last bit of data sufficient for what is needed for nba_salaries2? Or was that something entirely different than a sample of nba_salaries2

andresrcs · November 1, 2020, 2:35am

Please read the guide I gave you more carefully, you need to provide sample data (in a copy/paste friendly format) that allows us to run your code on our own, see what is going on and give you a solution.

JReezy · November 1, 2020, 3:33am

I'm unable to fix the error messages that populate when attempting to render the reprex. Not sure what else to try as I may be making a bigger mess than I'm attempting to clean up

andresrcs · November 1, 2020, 2:15pm

This is a reproducible example of making a left_join() (which means you can simply copy the code as it is and make it work on your computer). Try to make one that shows your issue.

library(dplyr)

# Sample data on a copy/paste friendly format
nba_salary <- data.frame(
    stringsAsFactors = FALSE,
    Players = c("Stephen Curry","Chris Paul","Russell Westbrook",
                "John Wall","James Harden","LeBron James",
                "Kevin Durant","Blake Griffin","Kyle Lowry",
                "Paul George"),
    Salary = c(40231758,
               38506482,38178000,37800000,37800000,
               37436858,37199000,34234964,33296296,33005556),
    Rank = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
)

nba_stats <- data.frame(
    stringsAsFactors = FALSE,
    Players = c("Stephen Curry","Chris Paul","Russell Westbrook",
                "John Wall","James Harden","LeBron James",
                "Kevin Durant","Blake Griffin","Kyle Lowry",
                "Paul George"),
    some_stat = rnorm(10)
)

# Relevant code
nba_stats %>% 
    left_join(nba_salary, by = "Players")
#>              Players   some_stat   Salary Rank
#> 1      Stephen Curry  0.01482341 40231758    1
#> 2         Chris Paul -0.18417913 38506482    2
#> 3  Russell Westbrook -0.42616613 38178000    3
#> 4          John Wall -0.63422415 37800000    4
#> 5       James Harden  1.34669284 37800000    5
#> 6       LeBron James -0.65364381 37436858    6
#> 7       Kevin Durant -0.69911135 37199000    7
#> 8      Blake Griffin  0.75686720 34234964    8
#> 9         Kyle Lowry -1.34995016 33296296    9
#> 10       Paul George  0.59446418 33005556   10

^{Created on 2020-11-01 by the reprex package (v0.3.0.9001)}

JReezy · November 1, 2020, 8:17pm

Ok I was able to get majority of the salaries joined to my stats data frame using the code you displayed. There were a few that the salary came through as an "NA" still but that's not a problem for my project. Still not entirely sure how to recreate my example in a reusable format to put on here. But will go back through the steps and try it again

system · November 22, 2020, 8:17pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.