
Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. Models are a fundamentally mathematical or computational tool, so they generally scale well. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are complementary tools to visualisation. Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them. A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data. A good visualisation will show you things that you did not expect, or raise new questions about the data. Visualisation is a fundamentally human activity. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. Together, tidying and transforming are called wrangling, because getting your data in a form that’s natural to work with often feels like a fight! Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Once you have tidy data, a common first step is to transform it. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.

Once you’ve imported your data, it is a good idea to tidy it. If you can’t get your data into R, you can’t do data science on it!

This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.
