I have a question with regards to combining two data bases with a linking table:
Has anyone ever tried to combine the monthly stock return data table of CRSP with the yearly accounting data tables of Compustat through the CCM table? WRDS only offers a SAS solution, but unfortunately not one for R.
In detail: I have a table with all stock returns month-wise and their respective PERMNO-code (CRSP). I also have the yearly accounting data of these stocks with their GVKEY (Compustat). Obviously, the accounting data is the same for the company during the year and only changes every 12 months. Now I have the linking table which assigns each GVKEY to a PERMNO. I would like to combine them the following way:
Company | Month/Year | Return for this month | Accounting data columns | Stock Return columns for the prior 20 months
Does anyone have experience in this field?
Thanks in advance!
There are two ways of doing it. If you have ccm then use the link permno, if not use cusip. Just translate the sas code into r, it should be very straightforward.
I am a finance major. So how did you solve it 3 years ago. For me it says, that my data is too large I think (if I run the final merge command based on the PERMNO, the console gives back an error "Error: cannot allocate vector of size 317.9 Mb").
When you download data in CSV and text from wrds, and then read it into r, you may exceed the memory limit. What is your computer RAM, how many rows does your crsp data have? I use wrds cloud with r, so I do not have this issue. If you really want do it locally, use dplyr and data.table
A good introduction to seeking help from coding forums is to pose your questions as a [reproducible example (what many folks call a reprex for short)](FAQ: What's a reproducible example (`reprex`) and how do I do one?, the reprex-r-package is a handy tool to help make this easy).
In terms of tools for merging two datasets, I personally like the tidyverse's dplyr's two-table verbs for stuff like this.. But I'd be worried about suggesting this until we more clearly understand your desired outcome - since the time series model you want to apply later may require your data is structured in a particular way.
The author links the Compustat data with the CRSP data. He also uses the CCM LNKHIST data as you require. Unfortunately I don`t have access to the CCM data so I have to link the data through the GVKEY, CUSIP / some other method.
For my master thesis I am trying to combine the monthly stock return data of CRSP with the annual fundamental data of Compustat. I'm reading the R code as shared by msmith01 and can follow the logic of the code as it is clearly written. However, I have a question about the first section.
The annual data is retrieved from Compustat and the link table from CRSP/Compustat Merged. The link table is merged with the Compustat dataset. The combined data is merged with CRSP.
What is the advantage of retrieving the fundamental data from Compustat and combine that with the link table over directly retrieving the fundamental data from the CRSP/Compustat Merged dataset? As far as I can see the CRSP/Compustat Merged dataset includes the desired linking variables for merging with CRSP. Are there disadvantages with using the CRSP/Compustat Merged dataset?
The CRSP and Compustat have more data than the CCM data. Moreover, merging these two data by yourself allow you to choose data frequency more flexibly. For example, daily CRSP + quarterly Compustat, etc.
Thank you for your reply. I was not aware that the data of CCM was not that extensive as Compustat or CRSP.
Will use your code to merge the two datasets.
I am trying to run your code to see if I can reproduce the result. Unfortunately I am running into an error. I am trying to run the code as is (only change is username/password). The code I am running is from:
When I get to this section:
###############################################################################################################
### COMPUSTAT CLEANING AND VAR CALC ###
# load("180619 data.ccm.RData")
data.comp <- data.ccm %>%
rename(PERMNO=permno) %>% data.table %>% # ensure col names match crsp's
group_by(PERMNO) %>%
mutate(datadate = as.yearmon(datadate),
comp.count = row(.)) %>% # number of years in data; future option to cut first year data; works but leads to warnings
# tests based on BE spread show FF no longer impose this condition (even though mentioned in FF'93)
ungroup %>% arrange(datadate, PERMNO) %>% data.frame %>%
distinct(datadate, PERMNO, .keep_all = TRUE) # hasn't been issue but just in case
I get the following error:
Error: Column `comp.count` must be length 1 (the group size), not 12099238
In addition: There were 50 or more warnings (use warnings() to see the first 50)
When I have time, I will update the code using a pure data.table approach. Over time, I personally think data.table may be better at handling this kind of daily stock return data.