Hey, I'm a begginer with Regular expressions (regex) and I cannot figure out how to solve this problem:
I have a database with some analytics from a bunch of webpages and I want to extract all the pages that are blogs.
The structure of the urls are kind of like this:
/main/index/pending
If you could help me
I gave you several URLs, do you have a method, you as a person, to tell which are blogs and which arent ?
If you're method involves reading the URL, or the HTML at the UML and reading particular strings of text, that signify that its a blog or not, then it will be possible to write a regex, otherwise, impossible.
Do you have a heuristic for blog detection, that you wish to implement ?
Im not sure if I got clearly your question, but those URLs that are from a blog Do have the word "blog" in some part within the URL, for example:
/profiles/blogs/job-titles-for-data-scientist
cool, that seems a simple enough rule.
So from your database, do you get some object that contains the URLs that you want to check through?
perhaps a dataframe, or a vector ?
If your object containing URL's is called myobj, you can share a small sample with us, by doing something like
dput(head(myobs,n=10))
which would give the first 10 of whatever you have, and make it so you can copy and paste it.
I got this, it is readable?
structure(list(Page = c("/", "/profiles/blogs/check-out-our-dsc-newsletter",
"/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r",
"/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship",
"/group/data-science-apprenticeship", "/profiles/blogs/six-categories-of-data-scientists",
"/group/data-science-certification", "/profiles/blogs/66-job-interview-questions-for-data-scientists",
"/profiles/blogs/my-data-science-book", "/profiles/blogs/17-short-tutorials-all-data-scientists-should-read-and-practice"
), Pageviews = c(73938, 18430, 12467, 11594, 11253, 9533, 9429,
8490, 8225, 8099), Unique Pageviews
= c(53297, 15636, 6964,
7621, 7964, 7497, 8059, 7970, 7125, 7283), Avg. Time on Page
= c(105.543782563807,
89.6291752702539, 155.347740667976, 105.651444662096, 79.8064188043847,
130.965612648221, 104.98207605985, 309.635729613734, 148.032440056417,
131.7038253931), Entrances = c(39150, 13933, 5075, 1876, 1950,
4806, 5443, 6680, 1915, 5315), Bounce Rate
= c(0.529553001277139,
0.63625924065169, 0.649852216748768, 0.502665245202559, 0.377948717948718,
0.563462338743238, 0.394084144773103, 0.851197604790419, 0.540992167101828,
0.547695202257761), percentage_exit = c(0.436703724742352, 0.553282691264243,
0.428411005053341, 0.295497671209246, 0.213631920376788, 0.46921221021714,
0.319546081238732, 0.780447585394582, 0.396595744680851, 0.473885664896901
), Page Value
= c(61203.0491763369, 20157.1122626153, 7975.81414935429,
3197.64636881145, 3483.41775526526, 7356.96611769642, 9146.71067981758,
8146.61012956419, 3070.51914893617, 8111.29769107297), percentage_of_total_visits = c(0.091818791446241,
0.0228870178575864, 0.0154819561383901, 0.0143978342398728, 0.0139743685269354,
0.0118384124382187, 0.0117092616049475, 0.0105431786007004, 0.0102140923428458,
0.0100576211409979), percentage_cumulated = c(0.091818791446241,
0.114705809303827, 0.130187765442217, 0.14458559968209, 0.158559968209026,
0.170398380647244, 0.182107642252192, 0.192650820852892, 0.202864913195738,
0.212922534336736)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
oh, the backticks got replaced.
In the future when you paste such code, please write in the forum post three back ticks and an r like this
```r
this will keep your code without formating away the ticks in it.
library(tidyverse)
example <- structure(list(Page = c("/", "/profiles/blogs/check-out-our-dsc-newsletter",
"/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r",
"/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship",
"/group/data-science-apprenticeship", "/profiles/blogs/six-categories-of-data-scientists",
"/group/data-science-certification", "/profiles/blogs/66-job-interview-questions-for-data-scientists",
"/profiles/blogs/my-data-science-book", "/profiles/blogs/17-short-tutorials-all-data-scientists-should-read-and-practice"
), Pageviews = c(73938, 18430, 12467, 11594, 11253, 9533, 9429, 8490, 8225, 8099),
`Unique Pageviews` = c(53297, 15636, 6964, 7621, 7964, 7497, 8059, 7970, 7125, 7283),
`Avg. Time on Page` = c(105.543782563807,
89.6291752702539, 155.347740667976, 105.651444662096, 79.8064188043847,
130.965612648221, 104.98207605985, 309.635729613734, 148.032440056417,
131.7038253931), Entrances = c(39150, 13933, 5075, 1876, 1950,
4806, 5443, 6680, 1915, 5315), `Bounce Rate` = c(0.529553001277139,
0.63625924065169, 0.649852216748768, 0.502665245202559, 0.377948717948718,
0.563462338743238, 0.394084144773103, 0.851197604790419, 0.540992167101828,
0.547695202257761), `percentage_exit` = c(0.436703724742352, 0.553282691264243,
0.428411005053341, 0.295497671209246, 0.213631920376788, 0.46921221021714,
0.319546081238732, 0.780447585394582, 0.396595744680851, 0.473885664896901
), `Page Value` = c(61203.0491763369, 20157.1122626153, 7975.81414935429,
3197.64636881145, 3483.41775526526, 7356.96611769642, 9146.71067981758,
8146.61012956419, 3070.51914893617, 8111.29769107297), percentage_of_total_visits = c(0.091818791446241,
0.0228870178575864, 0.0154819561383901, 0.0143978342398728, 0.0139743685269354,
0.0118384124382187, 0.0117092616049475, 0.0105431786007004, 0.0102140923428458,
0.0100576211409979), percentage_cumulated = c(0.091818791446241,
0.114705809303827, 0.130187765442217, 0.14458559968209, 0.158559968209026,
0.170398380647244, 0.182107642252192, 0.192650820852892, 0.202864913195738,
0.212922534336736)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
example$Page
> example$Page
[1] "/"
[2] "/profiles/blogs/check-out-our-dsc-newsletter"
[3] "/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r"
[4] "/group/data-science-apprenticeship/forum/topics/update-about-our-data-science-apprenticeship"
[5] "/group/data-science-apprenticeship"
[6] "/profiles/blogs/six-categories-of-data-scientists"
[7] "/group/data-science-certification"
[8] "/profiles/blogs/66-job-interview-questions-for-data-scientists"
[9] "/profiles/blogs/my-data-science-book"
[10] "/profiles/blogs/17-short-tutorials-all-data-scientists-should-read-and-practice"
example$blog_urls <- grepl(pattern = "blogs",example$Page)
frame_keep_only_blogs <- dplyr::filter(example, blog_urls)
> frame_keep_only_blogs
# A tibble: 6 x 11
Page Pageviews `Unique Pagevie~ `Avg. Time on P~ Entrances `Bounce Rate` percentage_exit `Page Value` percentage_of_t~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 /pro~ 18430 15636 89.6 13933 0.636 0.553 20157. 0.0229
2 /pro~ 12467 6964 155. 5075 0.650 0.428 7976. 0.0155
3 /pro~ 9533 7497 131. 4806 0.563 0.469 7357. 0.0118
4 /pro~ 8490 7970 310. 6680 0.851 0.780 8147. 0.0105
5 /pro~ 8225 7125 148. 1915 0.541 0.397 3071. 0.0102
6 /pro~ 8099 7283 132. 5315 0.548 0.474 8111. 0.0101
# ... with 2 more variables: percentage_cumulated <dbl>, blog_urls <lgl>
in this way we've gone from a 10 row frame to a 6 row frame
That is great, thank you so much for the help
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.