Recently I came across a dataset related to a personal care products store. The store sold hand sanitisers, hand wash, face masks and a whole lot of other personal care products. My goal was to retrieve the data from their POS (Point of Sales) system and clean the dataset so that it was ready for further analysis.
The sales Dataset was downloaded as a .csv file from the POS system. This file contained 14 columns. The 12th column contained the price of each order. But the column name in the .csv file was not quite explanatory. Hence it was renamed “price”
colnames(products)[12] <- "Price"
Before moving any further, I must tell you that I will be using the libraries under the “tidyverse” package. Tidyverse is a powerful package that comes with so many useful libraries. Especially it comes in handy when you need to do Data Science related tasks in R. Simply put, Tidyverse is the Data Science workflow in R.
Extracting Specific Columns
As I told you earlier there were 14 columns in the dataset. But for the analysis, we needed only 5 of them. Therefore, a new data frame (a tibble) was created by extracting only the 5 columns.
prod_cleaned <- as.tibble(products[,c("Date","Product", "SKU", "Quantity", "Price")])
Filtering Data by a String
Since this shop sold so many products, it was essential to extract only the data related to sanitisers. For this, build in function “grepl” came in handy.
sanitizers <- prod_cleaned %>% filter(grepl('Sanitizer', Product))
trimming unwanted characters
The next challenge was to remove the unwanted characters in the “quantity” column. Since this sales data was related to several products and each product had different units, the “quantity” column was crowded with lots of different characters like “Pc(s)”, “Nos” and so on. Therefore I had to trim the unwanted characters and spare only the numerical part.
sanitizers$Quantity <- str_trim(str_replace(sanitizers$Quantity, "No\\(s\\)", ""),"both")
Conversion from “string” to “numeric”
The original type of the “Quantity” column was string. This is problematic for future analysis. Hence, it was converted to numeric.
sanitizers$Quantity <- as.numeric(sanitizers$Quantity)
Conversion from “String” to “Date”
Next up was the “Date” column. Similar to the “Quantity” column, this was also in the form of String. This should be converted into the proper “Date” format if we were to do further analysis. The “Lubridate” package that comes within “tidyverse” came in handy here.
library(lubridate)
sanitizers$Date <- dmy(sanitizers$Date)
sanitizers %>% arrange(ymd(sanitizers$Date))
Finally, our dataset is now all cleaned up and ready for analysis!