New in R and could use some help with fixing my CSV file

SawSonex · October 21, 2021, 11:42am

Hey,
I'm fairly new to R and I'm practicing some assignments for my class and I'm struggling with importing and editing my CSV file. I should also note I have to do this without modifying the files outside of R.

I use the read.csv function and then remove the rows I don't need.
All my data including dates end up in 1 column with " , " as separator but I would like to split them into their own columns.

example of 2 rows in my column:" 192607, 1.0874, 0.9349, -0.0695, 5.7168, 1.9620, 1.4222 "
" 192608, 0.7030, 1.2300, 5.3842, 2.7154, 2.6930, 6.3154 "

Here the date comes first and is followed by the 6 pieces of data that are separated with ", " .

I want to have a column for each piece of data and the date. Any help is greatly appreciated!

I tried using the stringr package and the str_split_fixed function but I couldn't seem to get it to work.

My code:

import data from CSV file

data <- read.csv("./6_Portfolios_2x3.CSV", sep=" ; ")

remove excess rows

data <- data[-c(1152:8478),]

nirgrahamuk · October 21, 2021, 12:05pm

if its a true csv, which has comma seperated values, then using read.csv with sep parameter set to semicolon sep=" ; " is a definite mistake.
the default sep="," should be used.

data <- read.csv("./6_Portfolios_2x3.CSV")

because its the default for the function it can be omitted.

SawSonex · October 21, 2021, 12:14pm

Oh yea sorry should have mentioned. semicolon here is definitely closer to what I want. That way all my data is correctly placed in corresponding rows, although still only in 1 column, while using the default ", " seperator means I only get one piece of data each row. So instead of one row with all data with the corresponding date like:

" 192607, 1.0874, 0.9349, -0.0695, 5.7168, 1.9620, 1.4222 "
I'd get:
192607
1.0874
0.9349
-0.0695
5.7168
1.9620
1.4222

Still only 1 column though.

nirgrahamuk · October 21, 2021, 12:22pm

Can you do this for me and share the results.

readLines(con=file("./6_Portfolios_2x3.CSV"),n = 6)

side note for forum use : to format results in a nice monospace format, use the backtick symbol on your keyboard three times on its own row before the text to format.

```
info to share in monospace

SawSonex · October 21, 2021, 1:10pm

Here are my results:

> readLines(con=file("./6_Portfolios_2x3.CSV"),n = 6)
[1] "This file was created by CMPT_ME_BEME_OP_INV_RETS using the 202108 CRSP database."                                            
[2] "It contains value- and equal-weighted returns for portfolios formed on ME and BEME."                                          
[3] ""                                                                                                                             
[4] "The portfolios are constructed at the end of June.  BEME is book value at the last fiscal year end of the prior calendar year"
[5] "divided by ME at the end of December of the prior year. "                                                                     
[6] "Annual returns are from January to December."   

Thanks for the forum use tip!

nirgrahamuk · October 21, 2021, 1:17pm

So this is clearly plain text as far as it goes. presumably at some point there is delimited values ? but you should identify on what line that begins. I have here showed the first 6 rows. (n=6)

increase n to the point where it shows your values of interest ideally there are comma seperated heading labels on one row and then rows below are the values, if they are seperated with "," then you will be using that, if they are seperated by another seperator ";" unlikely but possible, then use that.

you can hopefully just use read.csv with an appropriate row skip param.
like if your headings starts on line 23 and values on line 24 then

read.csv("./6_Portfolios_2x3.CSV",skip = 23)

SawSonex · October 21, 2021, 1:26pm

Huh weird. I've already removed the data I don't need and I've removed the unnecessary text at the top still that command gives me that text as output. Could it be reading from the unmodified file in the working directory perhaps?

Here's a screenshot of how my data looks. From 2 : 1143 (last line) there's only data. The text in row 1 should be the column names and are also separated by " , " .

SawSonex · October 21, 2021, 1:37pm

Using stringr and the str_split_fixed function looks like it works but it only shows the results in the console, the data itself doesn't change.

 str_split_fixed(data, ",", 7)
        [,1]     [,2]         [,3]        [,4]         [,5]        [,6]        [,7]       
   [1,] ""       "SMALL LoBM" "ME1 BM2"   "SMALL HiBM" "BIG LoBM"  "ME2 BM2"   "BIG HiBM" 
   [2,] "192607" "   1.0874"  "   0.9349" "  -0.0695"  "   5.7168" "   1.9620" "   1.4222"
   [3,] "192608" "   0.7030"  "   1.2300" "   5.3842"  "   2.7154" "   2.6930" "   6.3154"
   [4,] "192609" "  -2.9117"  "  -0.1303" "  -0.4374"  "   1.4287" "   0.0704" "  -0.7967"
   [5,] "192610" "  -3.8196"  "  -4.5860" "  -2.0112"  "  -3.5898" "  -2.3398" "  -4.0970"
   [6,] "192611" "   3.1806"  "   3.7233" "   2.0944"  "   3.1292" "   2.8952" "   3.4614"
   [7,] "192612" "   2.6862"  "   1.9547" "   3.2639"  "   2.9678" "   2.6016" "   2.3950"
   [8,] "192701" "  -0.7991"  "  -0.4209" "   3.9316"  "  -0.2621" "   0.0358" "   4.4635"
   [9,] "192702" "   3.7129"  "   6.1231" "   6.6192"  "   5.0044" "   3.5692" "   8.6305"
  [10,] "192703" "  -1.8324"  "  -1.6153" "  -3.1676"  "   1.3025" "   0.2287" "  -2.4764"

nirgrahamuk · October 21, 2021, 1:46pm

Perhaps be explicit. What path and filename is your original data ?
What path and filename is your stripped away data ?

P.s. what method stripping the data had you used? If not the skip?

SawSonex · October 21, 2021, 2:52pm

I finally got it to work with stringr

data <- str_split_fixed(data, "," , 7)

I completely forgot that if you don't reassign the data R just prints it. Thanks a bunch for helping out !

system · October 28, 2021, 2:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.