Handling non-English data

rotem · December 3, 2023, 5:37am

Hi, I am trying to analyse food prices from several countries, Israel is among them. The problem is that data is given in Hebrew and translating it in Excel makes the work a bit clumsy.
Is there a way to translate everything in R using Tokens, or kind of a dictionary, which I already prepared?

Thank you!

Here is part of my data. price is the last original variable and from there, it is my translation.

data <- data.frame(
  date = as.Date(c("2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01")),
  id = c(81, 83, 86, 90, 95, 13, 14, 16, 18, 19, 22, 139, 24, 25, 26, 27, 29, 30),
  sector_heb = c("הדרים", "הדרים", "הדרים", "הדרים", "הדרים", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות"),
  title_heb = c("אשכולית", "לימון", "פומלית", "קליפים (קלמנטינה)", "תפוז", "בטטה", "בטטה", "בצל", "בצל", "בצל", "בצל", "גויבות", "גזר", "גזר", "דלורית", "דלעת", "חסה", "חציל"),
  subtitle_heb = c("אדומות", NA, NA, "מיכל", "ולנסיה", NA, NA, "אדום", "בית-אלפא", "יבש", NA, NA, NA, NA, NA, NA, "ערבית", "חממה"),
  spec_heb = c(NA, "קוטר 7", NA, NA, NA, NA, NA, NA, "קוטר 45-77", "יבוא", NA, NA, "ארוז", "תפזורת", NA, "10-25 ק\"ג", "יח' 1", NA),
  quality_heb = c("סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "מובחר", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א"),
  price = c(3.4, 5.1, 3.0, 3.5, 4.9, 2.7, 5.1, 3.0, 2.9, 2.3, 2.5, 7.0, 4.5, 4.0, 2.5, 2.5, 4.0, 3.0),
  sector = c("citrus", "citrus", "citrus", "citrus", "citrus", "vegetables", "vegetables", "vegetables", "vegetables", "vegetables", "vegetables", "fruit", "vegetables", "vegetables", "vegetables", "vegetables", "vegetables", "vegetables"),
  group = c("grapefruit", "lemon", "grapefruit", "clementine", "orange", "potato", "potato", "onion", "onion", "onion", "onion", "fruit_other", "carrot", "carrot", "cucurbita", "cucurbita", "lettuce", "aubergine"),
  title = c("grapefruit", "lemon", "oroblanco", "clementine", "orange", "sweetpotato", "sweetpotato", "onion", "onion", "onion", "onion", "psidium", "carrot", "carrot", "butternut", "pumpkin", "lettuce", "aubergine"),
  quality = c("normal", "normal", "normal", "normal", "normal", "normal", "premium", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal", "normal")
)

AlexisW · December 6, 2023, 4:49pm

I'm sure there are more powerful text processing functions that exist somewhere, here I will only discuss two "easy" solutions.

dataset

First, making sure I understand what you have, one one hand, the data in Hebrew:

data <- data.frame(
  date = as.Date(c("2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01", "2015-11-01")),
  id = c(81, 83, 86, 90, 95, 13, 14, 16, 18, 19, 22, 139, 24, 25, 26, 27, 29, 30),
  sector_heb = c("הדרים", "הדרים", "הדרים", "הדרים", "הדרים", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות", "ירקות"),
  title_heb = c("אשכולית", "לימון", "פומלית", "קליפים (קלמנטינה)", "תפוז", "בטטה", "בטטה", "בצל", "בצל", "בצל", "בצל", "גויבות", "גזר", "גזר", "דלורית", "דלעת", "חסה", "חציל"),
  subtitle_heb = c("אדומות", NA, NA, "מיכל", "ולנסיה", NA, NA, "אדום", "בית-אלפא", "יבש", NA, NA, NA, NA, NA, NA, "ערבית", "חממה"),
  spec_heb = c(NA, "קוטר 7", NA, NA, NA, NA, NA, NA, "קוטר 45-77", "יבוא", NA, NA, "ארוז", "תפזורת", NA, "10-25 ק\"ג", "יח' 1", NA),
  quality_heb = c("סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "מובחר", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א", "סוג א"),
  price = c(3.4, 5.1, 3.0, 3.5, 4.9, 2.7, 5.1, 3.0, 2.9, 2.3, 2.5, 7.0, 4.5, 4.0, 2.5, 2.5, 4.0, 3.0))

and on the other a dictionary, here I'm assuming a simple dataframe:

dict <- tribble(
  ~hebrew,	~english,
  "הדרים",	"citrus",
  "אשכולית"	, "grapefruit",
  "סוג א"	, "normal",
  "לימון"	, "lemon",
  "פומלית"	, "oroblanco",
  "קליפים (קלמנטינה)"	, "clementine",
  "תפוז"	, "orange",
  "ירקות"	, "vegetables",
  "בטטה"	, "sweetpotato",
  "מובחר"	, "premium",
  "בצל"	, "onion",
  "גויבות"	, "psidium",
  "גזר"	, "carrot",
  "דלורית"	, "butternut",
  "דלעת"	, "pumpkin",
  "חסה"	, "lettuce",
  "חציל"	, "aubergine"
)

Translate whole words

It's pretty easy if you just replace whole "cells", i.e. if the Hebrew says "קליפים (קלמנטינה)" you replace with "clementine". This is what is done with e.g. a join:

data |>
  left_join(dict |> rename(sector = english),
            by = c(sector_heb = "hebrew"))

or to make things more convenient, you can put the code in a function:

translate <- function(x, dict){
  dict$english[match(x, dict$hebrew)]
}

data |>
  mutate(sector = translate(sector_heb, dict),
         title = translate(title_heb, dict))

That way, you can create columns in English, as long as each Hebrew entry has an exact match in the dictionary.

Partial translations

Things get messier for spec_heb: I'm assuming that for "קוטר 7" you'd want to extract 7 and קוטר and translate them separately. One solution would be to just replace individual words:

minidict <- c("קוטר" = "diameter")

str_replace_all(data$spec_heb, minidict)

Perfect translations

For your particular case, that looks like it could be enough, you just have to make sure you have the right function to translate the right column (and look at the results to make sure you're not forgetting special cases).

If things get more complicated, I suspect you'd end up having to take into account a lot of special cases (because human languages are complicated). In this situation, it might be a good case for Deep Learning Models, maybe you could use a Google Cloud or Microsoft Azure translation API. I don't have direct experience with those.

rotem · December 7, 2023, 3:32am

Thanks so much ! Such a simple, yet effective idea

system · January 18, 2024, 3:33am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.