Hi, and welcome!
A couple of things preliminarily. Most of what we do here is to help solve usage errors illustrated in an reproducible example, called a reprex. Also, there's a homework policy, if that's applicable.
@mattwrketin sketched an approach. I'd like to add some foundational comments on two important parts of your question that most beginners in classical statistics struggle with: "proof" and "significance."
Classical statistics doesn't prove hypotheses, it either rejects or fails to reject them. In your example, it may be possible to say that a difference between exists and the difference isn't, to a greater of lesser degree of confidence, likely to be due simply to chance.
In formal terms, we say that the null hypothesis, conventionally designated H_0 is that some statistical measure, or parameter
like a mean, is observed simply by chance. The alternative hypothesis, H_1, is that the test statistic is not simply random. When we reject H_0, we accept H_1; when we fail to reject H_0, we reject H_1.
The distinction takes on a greater urgency when you consider confirmation bias
, the human tendency to look for reasons to prove what we want to believe.
The second foundational point is the importance of understanding the word significance in statistics.
You keep using that word. I do not think it means what you think it means. IƱigo Montoya
Here are a few things it doesn't mean:
- meaningful
- important
- proven
Statistical significance is a measure of the likelihood that the result of a test statistic is due to random variation.
You often hear of some result being described as having a 95% probability or confidence. What this comes down to is that there is one chance in 20 that the result is just random noise. Technically, the measure is called a p-value
based on a confidence level called \alpha, and for the 95% example \alpha = 0.05, so 1 - \alpha = 0.95.
I call this value of \alpha passing the laugh test. Think of four five-shot revolvers lying on the table with a single bullet in one of them. Hold it to your head and pull the trigger for $1 million? Not a great bet.
On to the choice of a statistical design for your problem. Let's focus on one country at a time, say Japan.
We have two variables, the price of oil and consumption.
Price of oil is usually measured by the "barrel" of 42 US gallons, or just short of 159 liters. But, there are different types of crude oil and therefore different benchmarks. You will want to use one that is appropriate to Japan, considering the markets in which it buys crude.
Crude oil is not directly useful for much beyond refining into petrochemical products, so you need to consider what measure of consumption is relevant. As the industrial input? Or as an ultimate consumer product, such as gasoline or heating oil?
Once you've picked the units of measurement for prices, it's time to assign them roles as dependent
and independent
variables. Conventionally, Y is used for the dependent, and X for the independent variable.
So, how are Y and X associated?
A very useful rule of thumb is to start off with the assumption that the association is linear
, because so many things are. To restrict @mattwrketin's model to just Japan
fit <- lm(Y ~ X, data = prices)
Why just Japan? Because if there's no association between price and consumption in Japan, there's no reason to look into differences with Saudi Arabia.
Linear regression is easy to do, but can be more difficult to interpret results. See my post for an orientation.