I am trying to learn textual analysis using R in R studio. So, i am starting from a less data.

I will share my data and code i used. seems like my analysis is going wrong. try to upload the necessary packages
library(tidytext)
library(tidyverse)
library(stringr)

add a customs word to recode few sentence

Custome <- c(
"CA COLON METASTASIS TO BRAIN" = "brain_diseases",
"STOMACHACHE & HEADACHE" = "stomach_&headache",
"ACCIDENTAL FALL" = "accident",
"ACUTE RESPIRATORY SYNDROME" = "resporatoty_diseases",
"Advance Anal Melanoma Cancer" = "cancer",
"ARRHYTHMIA, SEPSIS AND METASTATIC THYMIC CANCER." = "cancer",
"BLOOD PRESSURE-ISSUED BY LOCAL GOVT AUTHORITY (GUP)" = "blood_pressure",
"CARCINOMA STOMACH" = "stomach
&_headache",
"CARDIO RESPIRATORY FAILURE" = "resporatoty_diseases",
"CARDIOPULMONARY ARREST DUE TO BIKE ACCIDENT" = "accident",
"DUE TO ACCIDENTAL TOUCHING OF ELECTRIC FENCE" = "accident",
"FIRE ACCIDEN" = "accident",
"FLASH FLOOD" = "accident",
"GALL BLADDER CANCER" = "cancer",
"HEPATIC ENCEPHALOPATHY, DECOMPENSATED CHRONIC LIVER DISEAS" = "liver_diseases",
"ISCHAEMIC STROKE WITH HAEMORRHAGIC TRANSGORMATION" = "stroke",
"Massive intracranial hemorrhage leading to cardiorespiratory arrest due to motor vehicle accident" = "accident",
"MOTOR VEHICLE ACCIDENT" = "accident",
"MOTOR VEHICLE ACCIDENT" = "accident",
"MULTIPLE ORGAN FAILURE DUE TO NEROTOXIC SNAKE BITE" = "snake_bite",
"PRESSURE AND SWOLLEN BRAIN" = "blood_pressure",
"STROKE (HEART)" = "stroke",
"SUDDEN DEATH. AS PER MEDICAL REPORT, THE PROPOSER HAD HYPERTENSION AND HE WAS ON MEDICATION ON WHICH AN EXTRA PREMIUM OF NU. 2 PER 1000 WAS LOADED" ="unknown",
"CA COLON METASTASIS TO BRAIN" = "cancer",
"CA RECTUM" = "cancer",
"COMPLICATIONS OF SLE" = "unknown",
"HYPOXIC CARDIAC ARREST" = "heart_diseases",
"ILLNESS(SEPSIS)" = "sick",
"PULMONART HYPERTENSION" = "blood_pressure",
"Raised ICP" = "brain_diseases"
)

Carry out text cleaning

Cause_of_death <- Cause_of_death %>%
mutate(
CAUSE OF DEATH = str_to_lower(CAUSE OF DEATH), # converts to lower case
CAUSE OF DEATH = str_replace_all(CAUSE OF DEATH, "[[1] ]", ""), # removes that is not a letter or numeric (puntucation and other)
CAUSE OF DEATH = str_squish(CAUSE OF DEATH), # removes leading or trailing spaces
)

Tokenization; breaking down the text into smaller word or units

Cause_Tokenization <- Cause_of_death %>%
unnest_tokens(word, CAUSE OF DEATH)

Count frequency

Cause_frequency <- Cause_Tokenization %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE) and a colon metastasis to brain
NA
2
stomachache headache
NA
3
massive intracranial hemorrhage leading to cardiorespiratory arrest due to motor vehicle accident
NA
4
pulmonart hypertension
NA
5
pressure and swollen brain
NA
6
ischaemic stroke with haemorrhagic transgormation
NA
7
raised icp
NA
8
hepatic encephalopathy decompensated chronic liver disease
NA
9
acute respiratory syndrome
NA
10
unknown
NA
11
flash flood
NA
12
ca rectum
NA
13
motor vehicle accident
NA
14
sick
NA
15
due to accidental touching of electric fence
NA
16
gall bladder cancer
NA
17
unknown
NA
18
stroke heart
NA
19
advance anal melanoma cancer
NA
20
hypoxic cardiac arrest
NA
21
fire accident
NA
22
complications of sle
NA
23
blood pressure
NA
24
arrhythmia sepsis and metastatic thymic cancer
NA
25
multiple organ failure due to nerotoxic snake bite
NA
26
accidental fall
NA
27
blood pressure
NA
28
cardio respiratory failure
NA
29
motor vehicle accident
NA
30
cardiopulmonary arrest due to bike accident
NA
31
sudden death as per medical report the proposer had hypertension and he was on medication on which an extra premium of nu 2 per 1000 was loaded
NA
32
blood pressureissued by local govt authority gup
NA
33
stroke
NA
34
accident
NA
35
carcinoma stomach
NA
36
illnesssepsis


  1. :alnum: ↩︎

Can you explain a bit more abut what you are doing? It appears than you have a data set called Cause_of_death but without a sample of it, I don't see what we can do.

Paste your code here between

```

```

This gives us a nicely formatted code.

A handy way to supply data is to use the dput() function. Do dput(mydata) where "mydata" is the name of your dataset. For really large datasets probably dput(head(mydata, 100)) will do. Paste the output between
```

```

You may also find this helpful.