I need help with Regex

hobaez · February 17, 2020, 4:38pm

Hey, I'm a begginer with Regular expressions (regex) and I cannot figure out how to solve this problem:
I have a database with some analytics from a bunch of webpages and I want to extract all the pages that are blogs.
The structure of the urls are kind of like this:
/main/index/pending
If you could help me

nirgrahamuk · February 17, 2020, 4:47pm

I gave you several URLs, do you have a method, you as a person, to tell which are blogs and which arent ?
If you're method involves reading the URL, or the HTML at the UML and reading particular strings of text, that signify that its a blog or not, then it will be possible to write a regex, otherwise, impossible.

Do you have a heuristic for blog detection, that you wish to implement ?

hobaez · February 17, 2020, 4:51pm

Im not sure if I got clearly your question, but those URLs that are from a blog Do have the word "blog" in some part within the URL, for example:
/profiles/blogs/job-titles-for-data-scientist

nirgrahamuk · February 17, 2020, 4:54pm

cool, that seems a simple enough rule.
So from your database, do you get some object that contains the URLs that you want to check through?
perhaps a dataframe, or a vector ?
If your object containing URL's is called myobj, you can share a small sample with us, by doing something like

dput(head(myobs,n=10))
which would give the first 10 of whatever you have, and make it so you can copy and paste it.

hobaez · February 17, 2020, 4:57pm

I got this, it is readable?

structure(list(Page = c("/", "/profiles/blogs/check-out-our-dsc-newsletter",
"/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r",
"/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship",
"/group/data-science-apprenticeship", "/profiles/blogs/six-categories-of-data-scientists",
"/group/data-science-certification", "/profiles/blogs/66-job-interview-questions-for-data-scientists",
"/profiles/blogs/my-data-science-book", "/profiles/blogs/17-short-tutorials-all-data-scientists-should-read-and-practice"
), Pageviews = c(73938, 18430, 12467, 11594, 11253, 9533, 9429,
8490, 8225, 8099), Unique Pageviews = c(53297, 15636, 6964,
7621, 7964, 7497, 8059, 7970, 7125, 7283), Avg. Time on Page = c(105.543782563807,
89.6291752702539, 155.347740667976, 105.651444662096, 79.8064188043847,
130.965612648221, 104.98207605985, 309.635729613734, 148.032440056417,
131.7038253931), Entrances = c(39150, 13933, 5075, 1876, 1950,
4806, 5443, 6680, 1915, 5315), Bounce Rate = c(0.529553001277139,
0.63625924065169, 0.649852216748768, 0.502665245202559, 0.377948717948718,
0.563462338743238, 0.394084144773103, 0.851197604790419, 0.540992167101828,
0.547695202257761), percentage_exit = c(0.436703724742352, 0.553282691264243,
0.428411005053341, 0.295497671209246, 0.213631920376788, 0.46921221021714,
0.319546081238732, 0.780447585394582, 0.396595744680851, 0.473885664896901
), Page Value = c(61203.0491763369, 20157.1122626153, 7975.81414935429,
3197.64636881145, 3483.41775526526, 7356.96611769642, 9146.71067981758,
8146.61012956419, 3070.51914893617, 8111.29769107297), percentage_of_total_visits = c(0.091818791446241,
0.0228870178575864, 0.0154819561383901, 0.0143978342398728, 0.0139743685269354,
0.0118384124382187, 0.0117092616049475, 0.0105431786007004, 0.0102140923428458,
0.0100576211409979), percentage_cumulated = c(0.091818791446241,
0.114705809303827, 0.130187765442217, 0.14458559968209, 0.158559968209026,
0.170398380647244, 0.182107642252192, 0.192650820852892, 0.202864913195738,
0.212922534336736)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))

nirgrahamuk · February 17, 2020, 5:05pm

oh, the backticks got replaced.
In the future when you paste such code, please write in the forum post three back ticks and an r like this

```r

this will keep your code without formating away the ticks in it.

library(tidyverse)
example <- structure(list(Page = c("/", "/profiles/blogs/check-out-our-dsc-newsletter",
 "/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r",
 "/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship",
 "/group/data-science-apprenticeship", "/profiles/blogs/six-categories-of-data-scientists",
 "/group/data-science-certification", "/profiles/blogs/66-job-interview-questions-for-data-scientists",
 "/profiles/blogs/my-data-science-book", "/profiles/blogs/17-short-tutorials-all-data-scientists-should-read-and-practice"
), Pageviews = c(73938, 18430, 12467, 11594, 11253, 9533, 9429, 8490, 8225, 8099),
`Unique Pageviews` = c(53297, 15636, 6964, 7621, 7964, 7497, 8059, 7970, 7125, 7283),
`Avg. Time on Page` = c(105.543782563807,
  89.6291752702539, 155.347740667976, 105.651444662096, 79.8064188043847,
  130.965612648221, 104.98207605985, 309.635729613734, 148.032440056417,
  131.7038253931), Entrances = c(39150, 13933, 5075, 1876, 1950,
 4806, 5443, 6680, 1915, 5315), `Bounce Rate` = c(0.529553001277139,
  0.63625924065169, 0.649852216748768, 0.502665245202559, 0.377948717948718,
  0.563462338743238, 0.394084144773103, 0.851197604790419, 0.540992167101828,
  0.547695202257761), `percentage_exit` = c(0.436703724742352, 0.553282691264243,
  0.428411005053341, 0.295497671209246, 0.213631920376788, 0.46921221021714,
  0.319546081238732, 0.780447585394582, 0.396595744680851, 0.473885664896901
  ), `Page Value` = c(61203.0491763369, 20157.1122626153, 7975.81414935429,
  3197.64636881145, 3483.41775526526, 7356.96611769642, 9146.71067981758,
  8146.61012956419, 3070.51914893617, 8111.29769107297), percentage_of_total_visits = c(0.091818791446241,
  0.0228870178575864, 0.0154819561383901, 0.0143978342398728, 0.0139743685269354,
  0.0118384124382187, 0.0117092616049475, 0.0105431786007004, 0.0102140923428458,
  0.0100576211409979), percentage_cumulated = c(0.091818791446241,
  0.114705809303827, 0.130187765442217, 0.14458559968209, 0.158559968209026,
  0.170398380647244, 0.182107642252192, 0.192650820852892, 0.202864913195738,
  0.212922534336736)), row.names = c(NA, -10L), class = c("tbl_df",
  "tbl", "data.frame"))

example$Page
> example$Page
[1] "/"                                                                                           
[2] "/profiles/blogs/check-out-our-dsc-newsletter"                                                
[3] "/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r"                          
[4] "/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship"
[5] "/group/data-science-apprenticeship"                                                          
[6] "/profiles/blogs/six-categories-of-data-scientists"                                           
[7] "/group/data-science-certification"                                                           
[8] "/profiles/blogs/66-job-interview-questions-for-data-scientists"                              
[9] "/profiles/blogs/my-data-science-book"                                                        
[10] "/profiles/blogs/17-short-tutorials-all-data-scientists-should-read-and-practice"   

example$blog_urls <- grepl(pattern = "blogs",example$Page)

frame_keep_only_blogs <- dplyr::filter(example, blog_urls)
> frame_keep_only_blogs
# A tibble: 6 x 11
Page  Pageviews `Unique Pagevie~ `Avg. Time on P~ Entrances `Bounce Rate` percentage_exit `Page Value` percentage_of_t~
<chr>     <dbl>            <dbl>            <dbl>     <dbl>         <dbl>           <dbl>        <dbl>            <dbl>
1 /pro~     18430            15636             89.6     13933         0.636           0.553       20157.           0.0229
2 /pro~     12467             6964            155.       5075         0.650           0.428        7976.           0.0155
3 /pro~      9533             7497            131.       4806         0.563           0.469        7357.           0.0118
4 /pro~      8490             7970            310.       6680         0.851           0.780        8147.           0.0105
5 /pro~      8225             7125            148.       1915         0.541           0.397        3071.           0.0102
6 /pro~      8099             7283            132.       5315         0.548           0.474        8111.           0.0101
# ... with 2 more variables: percentage_cumulated <dbl>, blog_urls <lgl>

in this way we've gone from a 10 row frame to a 6 row frame

hobaez · February 17, 2020, 5:16pm

That is great, thank you so much for the help

system · March 9, 2020, 5:16pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.