## Importing packages
# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image:
# https://github.com/kaggle/docker-rstats
library(tidyverse) # metapackage with lots of helpful functions
## Running code
# In a notebook, you can run a single code cell by clicking in the cell and then hitting
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script,
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.
## Reading in files
# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below.
list.files(path = "../input")
## Saving data
# If you save any files or images, these will be put in the "output" directory. You
# can see the output directory by committing and running your kernel (using the
# Commit & Run button) and then checking out the compiled version of your kernel.
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ── ✔ ggplot2 3.3.0.9000 ✔ purrr 0.3.3 ✔ tibble 2.1.3 ✔ dplyr 0.8.4 ✔ tidyr 1.0.2 ✔ stringr 1.4.0 ✔ readr 1.3.1 ✔ forcats 0.4.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
options(scipen = 15)
options(repr.plot.width=16, repr.plot.height=10)
options(warn=-1)
Working on the Craigslist Dataset uploaded on kaggle, the aim on the project is to predict the price based on factors decided after analysis. Decription of columns:-
Let us first explore the dataset
library('fastDummies')
library('caTools')
library('sf')
library('ggplot2')
library('rnaturalearth')
library('rnaturalearthdata')
library('maps')
library('mapdata')
library('FNN')
library('xgboost')
library('caret')
library('factoextra')
library('cluster')
df = as.data.frame(read_csv('/kaggle/input/craigslist-carstrucks-data/vehicles.csv'))
head(df,n = 3)
Parsed with column specification: cols( .default = col_character(), id = col_double(), price = col_double(), year = col_double(), odometer = col_double(), county = col_logical(), lat = col_double(), long = col_double() ) See spec(...) for full column specifications.
id | url | region | region_url | price | year | manufacturer | model | condition | cylinders | ⋯ | drive | size | type | paint_color | image_url | description | county | state | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <chr> | <chr> | <chr> | <chr> | ⋯ | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <lgl> | <chr> | <dbl> | <dbl> | |
1 | 7088746062 | https://greensboro.craigslist.org/ctd/d/cary-2012-acura-tl-base-4dr-sedan/7088746062.html | greensboro | https://greensboro.craigslist.org | 10299 | 2012 | acura | tl | NA | NA | ⋯ | NA | NA | other | blue | https://images.craigslist.org/01414_3LIXs9EO33z_600x450.jpg | 2012 Acura TL Base 4dr Sedan Offered by: Best Import Auto Sales Inc — (919) 800-0650 — $10,299 PRISTINE CONDITION INSIDE AND OUT Best Import Auto Sales Inc Year: 2012 Make: Acura Model: TL Series: Base 4dr Sedan VIN: 19UUA8F22CA003926 Condition: Used Mileage: 90,186 Exterior: Blue Interior: Black Body: Sedan Transmission: Automatic 6-Speed Engine: 3.5L V6 **** Best Import Auto Sales Inc. 🚘 Raleigh Auto Dealer ***** ⚡️⚡️⚡️ Call Or Text (919) 800-0650 ⚡️⚡️⚡️ ✅ - We can arrange Financing Options with most banks and credit unions!!!! ✅ Extended Warranties Available on most vehicles!! "Call To Inquire" ✅ Full Service ASE-Certified Shop Onsite! More vehicle details: best-import-auto-sales-inc.hammerwebsites.net/v/cfoamRwq Address: 1501 Buck Jones Rd Raleigh, NC 27606 Phone: (919) 800-0650 Website: www.bestimportsonline.com 📲 ☎️ Call or text (919) 800-0650 for quick answers to your questions about this Acura TL Your message will always be answered by a real human — never an automated system. Disclaimer: Best Import Auto Sales Inc will never sell, share, or spam your mobile number. Standard text messaging rates may apply. 2012 Acura TL Base 4dr Sedan 30b9c4702111452eb57503c99e795660 | NA | nc | 35.7636 | -78.7443 |
2 | 7088745301 | https://greensboro.craigslist.org/ctd/d/bmw-3-series-335-convertible-harmon/7088745301.html | greensboro | https://greensboro.craigslist.org | 0 | 2011 | bmw | 335 | NA | 6 cylinders | ⋯ | rwd | NA | convertible | blue | https://images.craigslist.org/00S0S_1kTatLGLxB5_600x450.jpg | BMW 3 Series 335i Convertible Navigation Dakota Leather Heated Seats Automatic High Beam We Finance!Price: Call for PricingCall Today 888-728-7443You can Fill out a Free Super Quick Pre-Approval Credit Application here!For Instant Online Approvals! OPEN TODAY! 888-768-8164Copy The Link Belowhttps://www.smartchevrolet.com/finance/apply-for-financing/Finance Online, We Can Ship To Your Door!We Have First Time Buyers Program!We Use Over 40 banks a Credit Unions With Lowest Rates Possible For All Types Of Credit!FINANCING FOR ALL TYPES OF CREDIT! BAD CREDIT, NO CREDIT, Repossession NO PROBLEM!SLOW PAYMENTS, BANKRUPTCY, REPOS NO PROBLEM!COLLECTIONS, JUDGEMENTS, DIVORCE NO PROBLEM!TRADE-INS Great! WE BUY CARS Everyday, Even if you don't buy Ours!Se Habla EspanolAsk for: Craigslist Salesstore: (888) 728-7443VIN Number: WBADX7C51BE579063Engine: 3.0L 6-Cylinder DOHC Twin TurbochargedStock No: 579063KBWarranty: Original Manufacturer WarrantyMiles: 115120Interior: Oyster BlackTrans: 6-Speed Automatic SteptronicExterior: Blue Water MetallicOptions:Navigation SystemReal Time Traffic InformationCold Weather PackageConvenience PackagePremium PackageConvertible Hardtop8 SpeakersAM/FM CD/MP3 RadioAM/FM RadioCD PlayerIpod & USB AdapterMP3 DecoderRadio Data SystemSIRIUS Satellite RadioAir ConditioningAutomatic Temperature ControlFront Dual Zone A/CRear Air ConditioningRear Window DefrosterComfort Access Keyless EntryLumbar SupportMemory SeatPower Driver SeatPower SteeringPower WindowsRemote Keyless EntrySteering Wheel Mounted A/C ControlsSteering Wheel Mounted Audio ControlsVoice CommandActive SteeringFour Wheel Independent SuspensionSpeed-Sensing SteeringSport SuspensionTraction Control4-Wheel Disc BrakesABS BrakesAnti-Whiplash Front Head RestraintsDual Front Impact AirbagsDual Front Side Impact AirbagsFront Anti-Roll BarIntegrated Roll-Over ProtectionKnee AirbagLow Tire Pressure WarningOccupant Sensing AirbagRear Anti-Roll BarBMW Assist W/Bluetooth®Smartphone IntegrationBrake AssistElectronic Stability ControlAutomatic High BeamsDelay-Off HeadlightsFront Fog LightsFully Automatic HeadlightsHigh Intensity Discharge Headlights: Bi-XenonRetractable Headlight WashersAnti-Theft Alarm SystemActive Cruise ControlSpeed ControlAuto-Dimming MirrorsBumpers: Body-ColorHeated Door MirrorsPower Door MirrorsAuto-Dimming Rear-View MirrorBamboo Anthracite Wood TrimConvertible Roof LiningDigital Compass MirrorDriver Door BinDriver Vanity MirrorFront Reading LightsGenuine Wood Console InsertGenuine Wood Dashboard InsertGenuine Wood Door Panel InsertHeated Steering WheelIlluminated EntryLeather Shift KnobOutside Temperature DisplayPark Distance ControlPassenger Vanity MirrorPower Convertible RoofRear Reading LightsRear Seat Center ArmrestSmartphone Integration (DISC)Sport Steering WheelTachometerTelescoping Steering WheelTilt Steering WheelTrip ComputerUniversal Garage-Door OpenerDakota Leather UpholsteryFront Bucket SeatsFront Center ArmrestHeated Front SeatsPower Passenger SeatThrough-Loading System W/Integrated Transport BagPassenger Door Bin17" Light Alloy Star-Spoke (Style 339) WheelsGlass Rear WindowRain Sensing WipersVariably Intermittent WipersCARFAX CERTIFIEDDescription:BMW 3 Series 335i Convertible, Navigation System, Dakota Leather, Heated Seats, Automatic High Beams, Carfax Cerified, Premium Package, Steptronic Auto Trans. We Finance! $11040 In Installed Options on the car!INSTALLED OPTIONS[205] Steptronic Automatic Trans.normalsport & manual shift modes$1,375[896] Blue Water Metallic $550[LCCX] Oyster/Black Dakota Leather $0[ZPP] Premium PackageUniversal garage-door openerAuto-dimming mirrorsAuto-dimming rearview mirrorLumbar supportInterior mirror with compassBMW Assist with Bluetooth$1,650[ZCV] Convenience PackageAlarm SystemComfort Access keyless entryPark Distance Control$1,250[217] Active Steering $1,550[5AC] Automatic High Beams $250[655] Sirius XM Radio W/ 1 Year Sub.(1) year subscription$350[6FL] I Pod And Usb Adapter $400[494] Heated Front Seats $500[248] Heated Steering Wheel $190[609] Navigation System16:9 high-resolution display3-D screenvoice command systemreal time traffic infoiDrive system w/on-board computer6 programmable memory buttons12 GB media storage$2,100[4BY] Bamboo Anthracite Wood Trim $0Original Shipping Charge $875RETAIL PRICE (ORIGINALLY NEW) $62,240.00You can Fill out a Free Super Quick Pre-Approval Credit Application here!https://www.smartchevrolet.com/finance/apply-for-financing/We Use Over 40 banks a Credit Unions For the Lowest Rates Possible For All Types Of Credit!FINANCING FOR ALL TYPES OF CREDIT!BAD CREDIT, NO CREDIT, Repossession NO PROBLEM!SLOW PAYMENTS, BANKRUPTCY, REPOS NO PROBLEM!COLLECTIONS, JUDGEMENTS, DIVORCE NO PROBLEM!MONTHLY PAYMENTS TO FIT ANY INCOME!335, 325, 328, 335, 330, 550, 528, 535, 525, 5 Series, 525i, 528i, 528e, 530i, 535i, 540i, 540, 545, 545i, 550i, 650, 750, 760, M3, M5, M6, X1, X3, X6, X5, Z4, X5 M, X6 M, 550 Gran Turismo, 535 Gran Turismo, ActiveHybrid X6, 740, Alpina B7, ActiveHybrid 750, 1 Series M, 640, ActiveHybrid 5, 320, 640 Gran Coupe, X1, 650 Gran Coupe, ActiveHybrid 3, ActiveHybrid 740, 228, 428, M6 Gran Coupe, 328 Gran Turismo, 335 Gran Turismo, 435, 535d, 328d, i3, i8, ActiveHybrid 7, M235, M4, X4, 435 Gran Coupe, 428 Coupe, 325 , 325i , 330 , 330i , 328 , 328i , 335 , 335i, 525 , 745i , 745li , 530i , 545 , 550 , 645 , 650, X5 , 750li , 750 , 750i , X3, X1, X6, 128i, 135i 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 BMW, 0 Down Payment, Suv, 4x4, Buy Here Pay Here, On Lot Financing, Owner Finance, 500 down Bad Credit, Bluetooth, Cheap for cheap, Cars owner Financing, Cheap down Payment, no money down, cars low down payment, car lots No Money Down Bad Credit, used cars for sale, Used SUV, bad credit, $1000 or less used cars, For Sale By Owner, Will Trade for Motorcycles, For Sale Near Me, for sale craigslist, low mileage, low miles, very low miles, 500 down no credit check, $500, Low Down Payment, No CreditA27FBAFAEA464DBBB650D168074EE06C 28003645 8284589BMW 335 335i | NA | nc | NA | NA |
3 | 7088744126 | https://greensboro.craigslist.org/cto/d/greensboro-2011-jaguar-xf-premier/7088744126.html | greensboro | https://greensboro.craigslist.org | 9500 | 2011 | jaguar | xf | excellent | NA | ⋯ | NA | NA | NA | blue | https://images.craigslist.org/00505_f22HGItCRpc_600x450.jpg | 2011 jaguar XF premium - estate sale. Retired lady executive. Like new, garaged and maintained. Very nice leather, heated seats, electric sunroof, metallic blue paint. 85K miles bumper-to-bumper warranty. Premium radio sound system. Built-in phone connection. Please call show contact info cell or show contact info . Asking Price $9500 | NA | nc | 36.1032 | -79.8794 |
print(dim(df))
names(df)
[1] 539759 25
summary(df)
id url region region_url Min. :7065764796 Length:539759 Length:539759 Length:539759 1st Qu.:7084240306 Class :character Class :character Class :character Median :7088972959 Mode :character Mode :character Mode :character Mean :7087608554 3rd Qu.:7092094043 Max. :7093536715 price year manufacturer model Min. : 0 Min. : 0 Length:539759 Length:539759 1st Qu.: 4400 1st Qu.:2007 Class :character Class :character Median : 9500 Median :2012 Mode :character Mode :character Mean : 195421 Mean :2010 3rd Qu.: 17926 3rd Qu.:2015 Max. :4294967295 Max. :2021 NA's :987 condition cylinders fuel odometer Length:539759 Length:539759 Length:539759 Min. : 0 Class :character Class :character Class :character 1st Qu.: 47038 Mode :character Mode :character Mode :character Median : 93300 Mean : 100801 3rd Qu.: 137755 Max. :64809218 NA's :98976 title_status transmission vin drive Length:539759 Length:539759 Length:539759 Length:539759 Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character size type paint_color image_url Length:539759 Length:539759 Length:539759 Length:539759 Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character description county state lat Length:539759 Mode:logical Length:539759 Min. :-84.91 Class :character NA's:539759 Class :character 1st Qu.: 34.20 Mode :character Mode :character Median : 39.08 Mean : 38.43 3rd Qu.: 42.45 Max. : 84.52 NA's :8974 long Min. :-176.81 1st Qu.:-106.78 Median : -88.10 Mean : -94.04 3rd Qu.: -81.22 Max. : 161.02 NA's :8974
The summary of dataset gives a holistic view of all the columns. It shows which columns are numerical and which are in text format.
sapply(df,function(x) sum(is.na(x)))
There are a lot of null values for certain columns. We can use a thumb rule that if any column contains atleast 35% - 40% null values, we can remove those columns from consideration in our case. Such columns do not add much to our analysis and are generally irrelevant to the goal.
remove_cols = function(data) {
threshold = dim(data)[1] * 0.4
cols = names(data)
irrelevant_cols = c()
for (col in cols) {
n_nulls = sum(is.na(data[col]))
if (n_nulls >= threshold) {
irrelevant_cols = c(irrelevant_cols,col)
}
}
return (irrelevant_cols)
}
remove = remove_cols(df)
remove
These columns contain about 40% Null values. It is better to remove these columns as mentioned above. Imputations to these columns are difficult due to lack of features from which these can be dervied from and not enough data to derive from its own column.
df = df[,!names(df) %in% remove]
head(df,3)
id | url | region | region_url | price | year | manufacturer | model | fuel | odometer | title_status | transmission | drive | type | paint_color | image_url | description | state | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
1 | 7088746062 | https://greensboro.craigslist.org/ctd/d/cary-2012-acura-tl-base-4dr-sedan/7088746062.html | greensboro | https://greensboro.craigslist.org | 10299 | 2012 | acura | tl | gas | 90186 | clean | automatic | NA | other | blue | https://images.craigslist.org/01414_3LIXs9EO33z_600x450.jpg | 2012 Acura TL Base 4dr Sedan Offered by: Best Import Auto Sales Inc — (919) 800-0650 — $10,299 PRISTINE CONDITION INSIDE AND OUT Best Import Auto Sales Inc Year: 2012 Make: Acura Model: TL Series: Base 4dr Sedan VIN: 19UUA8F22CA003926 Condition: Used Mileage: 90,186 Exterior: Blue Interior: Black Body: Sedan Transmission: Automatic 6-Speed Engine: 3.5L V6 **** Best Import Auto Sales Inc. 🚘 Raleigh Auto Dealer ***** ⚡️⚡️⚡️ Call Or Text (919) 800-0650 ⚡️⚡️⚡️ ✅ - We can arrange Financing Options with most banks and credit unions!!!! ✅ Extended Warranties Available on most vehicles!! "Call To Inquire" ✅ Full Service ASE-Certified Shop Onsite! More vehicle details: best-import-auto-sales-inc.hammerwebsites.net/v/cfoamRwq Address: 1501 Buck Jones Rd Raleigh, NC 27606 Phone: (919) 800-0650 Website: www.bestimportsonline.com 📲 ☎️ Call or text (919) 800-0650 for quick answers to your questions about this Acura TL Your message will always be answered by a real human — never an automated system. Disclaimer: Best Import Auto Sales Inc will never sell, share, or spam your mobile number. Standard text messaging rates may apply. 2012 Acura TL Base 4dr Sedan 30b9c4702111452eb57503c99e795660 | nc | 35.7636 | -78.7443 |
2 | 7088745301 | https://greensboro.craigslist.org/ctd/d/bmw-3-series-335-convertible-harmon/7088745301.html | greensboro | https://greensboro.craigslist.org | 0 | 2011 | bmw | 335 | gas | 115120 | clean | automatic | rwd | convertible | blue | https://images.craigslist.org/00S0S_1kTatLGLxB5_600x450.jpg | BMW 3 Series 335i Convertible Navigation Dakota Leather Heated Seats Automatic High Beam We Finance!Price: Call for PricingCall Today 888-728-7443You can Fill out a Free Super Quick Pre-Approval Credit Application here!For Instant Online Approvals! OPEN TODAY! 888-768-8164Copy The Link Belowhttps://www.smartchevrolet.com/finance/apply-for-financing/Finance Online, We Can Ship To Your Door!We Have First Time Buyers Program!We Use Over 40 banks a Credit Unions With Lowest Rates Possible For All Types Of Credit!FINANCING FOR ALL TYPES OF CREDIT! BAD CREDIT, NO CREDIT, Repossession NO PROBLEM!SLOW PAYMENTS, BANKRUPTCY, REPOS NO PROBLEM!COLLECTIONS, JUDGEMENTS, DIVORCE NO PROBLEM!TRADE-INS Great! WE BUY CARS Everyday, Even if you don't buy Ours!Se Habla EspanolAsk for: Craigslist Salesstore: (888) 728-7443VIN Number: WBADX7C51BE579063Engine: 3.0L 6-Cylinder DOHC Twin TurbochargedStock No: 579063KBWarranty: Original Manufacturer WarrantyMiles: 115120Interior: Oyster BlackTrans: 6-Speed Automatic SteptronicExterior: Blue Water MetallicOptions:Navigation SystemReal Time Traffic InformationCold Weather PackageConvenience PackagePremium PackageConvertible Hardtop8 SpeakersAM/FM CD/MP3 RadioAM/FM RadioCD PlayerIpod & USB AdapterMP3 DecoderRadio Data SystemSIRIUS Satellite RadioAir ConditioningAutomatic Temperature ControlFront Dual Zone A/CRear Air ConditioningRear Window DefrosterComfort Access Keyless EntryLumbar SupportMemory SeatPower Driver SeatPower SteeringPower WindowsRemote Keyless EntrySteering Wheel Mounted A/C ControlsSteering Wheel Mounted Audio ControlsVoice CommandActive SteeringFour Wheel Independent SuspensionSpeed-Sensing SteeringSport SuspensionTraction Control4-Wheel Disc BrakesABS BrakesAnti-Whiplash Front Head RestraintsDual Front Impact AirbagsDual Front Side Impact AirbagsFront Anti-Roll BarIntegrated Roll-Over ProtectionKnee AirbagLow Tire Pressure WarningOccupant Sensing AirbagRear Anti-Roll BarBMW Assist W/Bluetooth®Smartphone IntegrationBrake AssistElectronic Stability ControlAutomatic High BeamsDelay-Off HeadlightsFront Fog LightsFully Automatic HeadlightsHigh Intensity Discharge Headlights: Bi-XenonRetractable Headlight WashersAnti-Theft Alarm SystemActive Cruise ControlSpeed ControlAuto-Dimming MirrorsBumpers: Body-ColorHeated Door MirrorsPower Door MirrorsAuto-Dimming Rear-View MirrorBamboo Anthracite Wood TrimConvertible Roof LiningDigital Compass MirrorDriver Door BinDriver Vanity MirrorFront Reading LightsGenuine Wood Console InsertGenuine Wood Dashboard InsertGenuine Wood Door Panel InsertHeated Steering WheelIlluminated EntryLeather Shift KnobOutside Temperature DisplayPark Distance ControlPassenger Vanity MirrorPower Convertible RoofRear Reading LightsRear Seat Center ArmrestSmartphone Integration (DISC)Sport Steering WheelTachometerTelescoping Steering WheelTilt Steering WheelTrip ComputerUniversal Garage-Door OpenerDakota Leather UpholsteryFront Bucket SeatsFront Center ArmrestHeated Front SeatsPower Passenger SeatThrough-Loading System W/Integrated Transport BagPassenger Door Bin17" Light Alloy Star-Spoke (Style 339) WheelsGlass Rear WindowRain Sensing WipersVariably Intermittent WipersCARFAX CERTIFIEDDescription:BMW 3 Series 335i Convertible, Navigation System, Dakota Leather, Heated Seats, Automatic High Beams, Carfax Cerified, Premium Package, Steptronic Auto Trans. We Finance! $11040 In Installed Options on the car!INSTALLED OPTIONS[205] Steptronic Automatic Trans.normalsport & manual shift modes$1,375[896] Blue Water Metallic $550[LCCX] Oyster/Black Dakota Leather $0[ZPP] Premium PackageUniversal garage-door openerAuto-dimming mirrorsAuto-dimming rearview mirrorLumbar supportInterior mirror with compassBMW Assist with Bluetooth$1,650[ZCV] Convenience PackageAlarm SystemComfort Access keyless entryPark Distance Control$1,250[217] Active Steering $1,550[5AC] Automatic High Beams $250[655] Sirius XM Radio W/ 1 Year Sub.(1) year subscription$350[6FL] I Pod And Usb Adapter $400[494] Heated Front Seats $500[248] Heated Steering Wheel $190[609] Navigation System16:9 high-resolution display3-D screenvoice command systemreal time traffic infoiDrive system w/on-board computer6 programmable memory buttons12 GB media storage$2,100[4BY] Bamboo Anthracite Wood Trim $0Original Shipping Charge $875RETAIL PRICE (ORIGINALLY NEW) $62,240.00You can Fill out a Free Super Quick Pre-Approval Credit Application here!https://www.smartchevrolet.com/finance/apply-for-financing/We Use Over 40 banks a Credit Unions For the Lowest Rates Possible For All Types Of Credit!FINANCING FOR ALL TYPES OF CREDIT!BAD CREDIT, NO CREDIT, Repossession NO PROBLEM!SLOW PAYMENTS, BANKRUPTCY, REPOS NO PROBLEM!COLLECTIONS, JUDGEMENTS, DIVORCE NO PROBLEM!MONTHLY PAYMENTS TO FIT ANY INCOME!335, 325, 328, 335, 330, 550, 528, 535, 525, 5 Series, 525i, 528i, 528e, 530i, 535i, 540i, 540, 545, 545i, 550i, 650, 750, 760, M3, M5, M6, X1, X3, X6, X5, Z4, X5 M, X6 M, 550 Gran Turismo, 535 Gran Turismo, ActiveHybrid X6, 740, Alpina B7, ActiveHybrid 750, 1 Series M, 640, ActiveHybrid 5, 320, 640 Gran Coupe, X1, 650 Gran Coupe, ActiveHybrid 3, ActiveHybrid 740, 228, 428, M6 Gran Coupe, 328 Gran Turismo, 335 Gran Turismo, 435, 535d, 328d, i3, i8, ActiveHybrid 7, M235, M4, X4, 435 Gran Coupe, 428 Coupe, 325 , 325i , 330 , 330i , 328 , 328i , 335 , 335i, 525 , 745i , 745li , 530i , 545 , 550 , 645 , 650, X5 , 750li , 750 , 750i , X3, X1, X6, 128i, 135i 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 BMW, 0 Down Payment, Suv, 4x4, Buy Here Pay Here, On Lot Financing, Owner Finance, 500 down Bad Credit, Bluetooth, Cheap for cheap, Cars owner Financing, Cheap down Payment, no money down, cars low down payment, car lots No Money Down Bad Credit, used cars for sale, Used SUV, bad credit, $1000 or less used cars, For Sale By Owner, Will Trade for Motorcycles, For Sale Near Me, for sale craigslist, low mileage, low miles, very low miles, 500 down no credit check, $500, Low Down Payment, No CreditA27FBAFAEA464DBBB650D168074EE06C 28003645 8284589BMW 335 335i | nc | NA | NA |
3 | 7088744126 | https://greensboro.craigslist.org/cto/d/greensboro-2011-jaguar-xf-premier/7088744126.html | greensboro | https://greensboro.craigslist.org | 9500 | 2011 | jaguar | xf | gas | 85000 | clean | automatic | NA | NA | blue | https://images.craigslist.org/00505_f22HGItCRpc_600x450.jpg | 2011 jaguar XF premium - estate sale. Retired lady executive. Like new, garaged and maintained. Very nice leather, heated seats, electric sunroof, metallic blue paint. 85K miles bumper-to-bumper warranty. Premium radio sound system. Built-in phone connection. Please call show contact info cell or show contact info . Asking Price $9500 | nc | 36.1032 | -79.8794 |
Let us look at the unique values for each column, especially those that are measured on nominal or ordinal scale. This can give us a sense of which columns can be important to us. Some variables even on a nominal scale have too many unique values, such variables would not add anything to the final model (like url).
sapply(df,function(x) length(unique(x)))
From above, we see that there are columns having too many unique values and not a variable on interval/ratio scale, thus we can remove some columns we know will not be any help to us such as :-
The reason why we removed paint_color is because intuitively the color of the vehicle does not add anything to its price, mathematically it might be correlated for the sample we are analyzing on, but in common sense we know that is not true. Including it can create bias in the data.
irrelevant_cols = c(
'id',
'url',
'region',
'region_url',
'image_url',
'description',
'model',
'state',
'paint_color'
)
df = df[,!names(df) %in% irrelevant_cols]
head(df,5)
price | year | manufacturer | fuel | odometer | title_status | transmission | drive | type | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
1 | 10299 | 2012 | acura | gas | 90186 | clean | automatic | NA | other | 35.7636 | -78.7443 |
2 | 0 | 2011 | bmw | gas | 115120 | clean | automatic | rwd | convertible | NA | NA |
3 | 9500 | 2011 | jaguar | gas | 85000 | clean | automatic | NA | NA | 36.1032 | -79.8794 |
4 | 3995 | 2004 | honda | gas | 212526 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 |
5 | 41988 | 2016 | chevrolet | gas | NA | clean | automatic | NA | NA | 41.1345 | -96.2458 |
The dataset has been reduced to the above. The next step is to analyze the columns individually for feature engineering and selection. Before heading on we will remove all duplicate rows from the dataset.
df = distinct(df)
df = filter(df,rowSums(is.na(df)) != ncol(df))
The price column is the target column. Let us look at the distribution of the price
summary(df$price)
boxplot(df$price, ylab = 'Price Distribution')
Min. 1st Qu. Median Mean 3rd Qu. Max. 0 4300 8995 239825 16900 4294967295
We can see some absurd values in the price
column. These values usually skew the data and pull the mean much higher than it actually would be. We will have to remove the outliers. Since this is a used cars dataset, the prices being so high are unlikely.
Let us analyze these outliers on a granular level.
df[df$price > 500000,]
price | year | manufacturer | fuel | odometer | title_status | transmission | drive | type | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
27 | 14980499 | 2013 | ford | gas | 80683 | clean | automatic | rwd | NA | 39.7005 | -104.8650 |
1200 | 1316134911 | 2014 | gmc | gas | NA | clean | automatic | 4wd | SUV | 43.0765 | -76.1622 |
20231 | 1234567890 | 2005 | chevrolet | gas | 242099 | salvage | automatic | 4wd | pickup | 36.7159 | -83.7668 |
28098 | 24500499 | 2015 | toyota | gas | 103054 | clean | automatic | 4wd | pickup | 39.7005 | -104.8650 |
28954 | 1111111 | 2012 | toyota | gas | NA | clean | automatic | NA | NA | 25.9504 | -97.3217 |
33088 | 113297506 | 2017 | NA | gas | NA | clean | automatic | 4wd | truck | 36.4138 | -90.5885 |
35808 | 10000000 | 2003 | honda | gas | NA | missing | automatic | NA | NA | 32.2239 | -106.2630 |
36350 | 2820142816 | 1967 | NA | gas | NA | clean | automatic | NA | NA | 32.9374 | -96.7872 |
55358 | 972216506 | 2007 | dodge | gas | NA | salvage | automatic | rwd | pickup | 26.1771 | -98.1870 |
59262 | 1215752191 | 1961 | chevrolet | other | NA | clean | other | NA | NA | 42.0212 | -71.2178 |
61757 | 1215752191 | 1947 | ford | other | NA | clean | other | NA | NA | 42.0212 | -71.2178 |
62929 | 1215752191 | 1957 | chevrolet | other | NA | clean | other | NA | NA | 42.0212 | -71.2178 |
71813 | 900000 | 2018 | chevrolet | diesel | NA | clean | automatic | NA | NA | 39.7224 | -121.8110 |
72302 | 123456789 | 1955 | chevrolet | gas | NA | clean | manual | NA | NA | 33.6339 | -86.5981 |
75276 | 2521176519 | 2001 | ford | gas | NA | clean | automatic | 4wd | SUV | 34.9235 | -87.7610 |
92813 | 1200000 | 2007 | NA | gas | 30400 | clean | automatic | rwd | convertible | 34.4228 | -114.1250 |
96515 | 595000 | 2020 | NA | gas | NA | clean | automatic | NA | other | 33.4484 | -112.0740 |
100874 | 678910 | 1955 | NA | gas | NA | clean | manual | rwd | convertible | 32.2577 | -110.7990 |
111644 | 2500000 | 1947 | NA | gas | 0 | clean | manual | rwd | wagon | 34.8286 | -120.4270 |
113376 | 1111222223 | 2007 | cadillac | gas | NA | salvage | automatic | NA | NA | 26.2159 | -98.3253 |
118111 | 3755744318 | 2005 | ford | gas | 154000 | clean | automatic | 4wd | pickup | 28.5697 | -80.8191 |
121300 | 11111111 | 2008 | ford | gas | NA | clean | automatic | fwd | truck | 29.4666 | -98.5673 |
127568 | 600000 | 2006 | mercedes-benz | gas | 126500 | clean | automatic | rwd | sedan | 32.4142 | -110.9470 |
128373 | 1116080943 | 2012 | honda | gas | 98090 | clean | automatic | fwd | coupe | 29.6732 | -82.3226 |
128374 | 819412050 | 2015 | nissan | gas | 87397 | clean | automatic | fwd | sedan | 29.6732 | -82.3226 |
128375 | 2959541855 | 2010 | mazda | gas | 116804 | clean | manual | fwd | sedan | 29.6732 | -82.3226 |
128376 | 2868355711 | 2010 | toyota | gas | 126499 | clean | automatic | fwd | hatchback | 29.6732 | -82.3226 |
128524 | 629000 | 2008 | gmc | gas | 99251 | clean | automatic | rwd | SUV | 30.4549 | -84.3455 |
130907 | 123456789 | 2010 | cadillac | gas | 132321 | clean | automatic | rwd | NA | 27.4072 | -82.5303 |
132011 | 123456789 | 1995 | ford | gas | 148284 | clean | automatic | rwd | NA | 27.4072 | -82.5303 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
264446 | 999500 | 2013 | toyota | gas | 129000 | clean | automatic | fwd | NA | 33.9603 | -118.2860 |
272776 | 123456789 | 1950 | chevrolet | gas | NA | clean | automatic | NA | NA | 33.6339 | -86.5981 |
275628 | 123456789 | 2005 | jeep | gas | NA | clean | other | NA | NA | 38.0258 | -87.5965 |
279599 | 2521176519 | 1972 | dodge | gas | 53000 | clean | automatic | rwd | sedan | 43.1522 | -77.5057 |
288853 | 35006500 | 2012 | chevrolet | gas | NA | clean | automatic | fwd | NA | 39.7982 | -84.3211 |
290505 | 88888888 | 2008 | nissan | gas | NA | clean | automatic | 4wd | NA | 43.9519 | -75.8704 |
294472 | 25006500 | 2012 | chevrolet | gas | NA | clean | automatic | fwd | NA | 39.7982 | -84.3211 |
302823 | 11111111 | 1965 | ford | gas | NA | clean | automatic | NA | NA | 31.1925 | -97.0581 |
304242 | 39000000 | 1932 | ford | gas | NA | clean | automatic | rwd | other | 35.4141 | -80.6162 |
305590 | 1500000 | 2004 | infiniti | gas | 190000 | clean | automatic | 4wd | SUV | 35.2583 | -80.7948 |
306062 | 3795255043 | 2005 | lexus | gas | 133669 | clean | automatic | 4wd | SUV | 37.1763 | -113.2950 |
306064 | 3167547323 | 2014 | toyota | gas | 147107 | clean | automatic | fwd | sedan | 37.1763 | -113.2950 |
311018 | 1111111 | 1983 | mercedes-benz | diesel | 125000 | clean | manual | 4wd | convertible | 37.7756 | -78.5569 |
317497 | 22222222 | 1927 | ford | gas | NA | clean | automatic | rwd | other | 29.7325 | -82.0279 |
319904 | 17000000 | 2004 | ford | gas | NA | lien | automatic | rwd | pickup | 26.1489 | -98.2710 |
320618 | 2521065408 | 2007 | cadillac | gas | NA | salvage | automatic | NA | NA | 26.2937 | -98.3008 |
321083 | 2650000 | 1991 | NA | diesel | NA | clean | other | NA | NA | 35.9591 | -95.9997 |
325669 | 2220913095 | 1984 | NA | diesel | NA | clean | automatic | NA | NA | 40.6798 | -89.7952 |
331404 | 123456789 | 1994 | NA | gas | 142374 | clean | automatic | rwd | NA | 27.4072 | -82.5303 |
332536 | 3736928711 | 2013 | ford | gas | NA | clean | automatic | NA | NA | 34.1997 | -119.1810 |
332541 | 3736928711 | 2013 | honda | gas | NA | clean | automatic | NA | NA | 34.1997 | -119.1810 |
337745 | 123456789 | 1983 | ford | gas | NA | clean | automatic | NA | NA | 34.4121 | -119.8650 |
338669 | 111111111 | 1984 | NA | diesel | NA | clean | automatic | NA | NA | 40.6798 | -89.7952 |
340376 | 999999 | 2002 | dodge | gas | 154000 | clean | automatic | fwd | mini-van | 39.1341 | -89.8442 |
344139 | 1410065407 | 2014 | chevrolet | gas | NA | clean | automatic | 4wd | NA | 43.0765 | -76.1622 |
344450 | 1234567 | 1996 | mazda | gas | NA | clean | manual | rwd | NA | 46.5786 | -87.4545 |
352091 | 5599500 | 2019 | gmc | diesel | 55 | clean | automatic | 4wd | pickup | 38.5015 | -90.4683 |
354366 | 734294471 | 1984 | NA | diesel | NA | clean | automatic | NA | NA | 40.6798 | -89.7952 |
357775 | 28003500 | 2007 | chevrolet | gas | 2000 | clean | automatic | fwd | NA | 39.7982 | -84.3211 |
360089 | 1750000 | 2014 | ford | gas | NA | clean | automatic | 4wd | truck | 41.5792 | -75.2528 |
As a first step, let us remove data where the prices are more than 500000. Prices above this seem too high and could be erroneous. We will look at the distribution of the data again using a boxplot. Since the absurd very high prices are removed, we should now be able to see a better distribution.
df = filter(df,df$price <= 500000)
head(df,5)
price | year | manufacturer | fuel | odometer | title_status | transmission | drive | type | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
1 | 10299 | 2012 | acura | gas | 90186 | clean | automatic | NA | other | 35.7636 | -78.7443 |
2 | 0 | 2011 | bmw | gas | 115120 | clean | automatic | rwd | convertible | NA | NA |
3 | 9500 | 2011 | jaguar | gas | 85000 | clean | automatic | NA | NA | 36.1032 | -79.8794 |
4 | 3995 | 2004 | honda | gas | 212526 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 |
5 | 41988 | 2016 | chevrolet | gas | NA | clean | automatic | NA | NA | 41.1345 | -96.2458 |
summary(df$price)
boxplot(df$price, ylab = 'Price Distribution')
Min. 1st Qu. Median Mean 3rd Qu. Max. 0 4300 8995 12059 16900 479000
The box plot shows there are still many outliers.
To further rectify this, we will set up a threshold using the interquartile range
(IQR). Instead of deciding on a threshold myself, this method can easily give me a bound that would be reasonable with respect to the data in general.
interquartile range = 75% - 25%
The dataset used is very large and is bound to have many outliers. The interquartile range
(IQR) method automatically fits the distribution to normal form and remove the outliers. It does remove the data points very generously and we could be losing valuable information.
stats = summary(df$price)
interquartile = stats[5] - stats[2]
thresh = interquartile * 1.5
upper = stats[5] + thresh
lower = stats[2] - thresh
df = filter(df,between(df$price,lower,upper))
boxplot(df$price, ylab='Price Distribution')
Apart from the price, the derived dataset only contains two numerical columns, the odometer and the year. We will analyze these values to identify how they influence the price.
summary(df[c('odometer','year')])
odometer year Min. : 0 Min. : 0 1st Qu.: 50219 1st Qu.:2006 Median : 96697 Median :2011 Mean : 104519 Mean :2009 3rd Qu.: 141153 3rd Qu.:2015 Max. :64809218 Max. :2021 NA's :73181 NA's :399
The year column seems unrealistic at 2021. Let us plot a countplot for the year to see the trend in buying/selling used cars on craigslist. We can tell that sales actually began increasing after 1990. Hence we will consider data after this point in time as years below that wud have less representative data.
tbl = table(df$year)
barplot(tbl,
xlab = 'Year',
ylab = 'Number of cars sold',
main = "Cars sold per year of Craigslist",
col = '#ff9933'
)
Our assumption was right, only after 1990, there are significant data points for car being sold. So we will limit out year column to 1990 and 2020.
# df[between(df$year,1960,2020),]
df = filter(df,between(df$year,1985,2020))
The important thing to analyze is whether there exists a relationship between year and price.
plot(
df$year,
df$price,
xlab = 'Year',
ylab = 'Price',
main = 'Price vs Year of car sold',
col = '#009999'
)
abline(lm(price ~ year, data = df), col='#ff9933', lwd=4)
The plot shows a positive correlation between the year and price. Even though these exists some Linear correlation, prices from every year have a wide range and the increase in prices over the years is not continuous. Another issue is that the number of cars sold over the years varies drastically, this can cause the mean to shift. As shown in the plot below.
grouped = df %>% group_by(year) %>% summarise(avg_price = mean(price))
barplot(
names.arg = grouped$year,
height = grouped$avg_price,
xlab = 'Year',
ylab = 'Average price range',
main = 'Average price of cars sold over the years',
col = '#ff9933'
)
Intuituvely the affect of year on the price is limited to changes in markets and economies leading to price shifts in the entire industry. The plot shows this shift, where vehicles in years between 1985 and 2005 seemed to have lesser prices in general, then the industry shot up from 2006 onwards. As assumed before, the plot confirms it.
The year column can be considered for the model given its positive correlation.
The odometer column describes the distance the vehicle has already travelled. This is an important factor for a buyer. From the previous summary()
function run, odometer also has some very absurd values. These values are too high as most cars usually have a limit to how many miles it can travel before an engine replacement is required.
boxplot(df$odometer, ylab = 'Odometer Distribution')
Analyzing more granular information, i.e. odometer values that are above 150000.
The threshold is chosen arbitarily.
filter(df,df$odometer > 1500000)
price | year | manufacturer | fuel | odometer | title_status | transmission | drive | type | lat | long |
---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> |
2700 | 2003 | honda | gas | 2170000 | clean | automatic | fwd | mini-van | 36.1253 | -87.2190 |
2000 | 2003 | subaru | gas | 2000000 | clean | automatic | 4wd | NA | 44.5448 | -123.2660 |
2200 | 2004 | toyota | hybrid | 2999999 | clean | automatic | fwd | sedan | 42.3010 | -89.0833 |
900 | 2004 | nissan | gas | 2110000 | clean | automatic | 4wd | SUV | 41.2647 | -95.9639 |
7500 | 2000 | ford | gas | 8343040 | clean | automatic | fwd | convertible | 47.9652 | -120.2230 |
4000 | 2008 | ford | diesel | 3500000 | clean | automatic | rwd | truck | 41.7283 | -70.4400 |
8895 | 2004 | ford | gas | 2079503 | clean | automatic | 4wd | truck | 46.6165 | -120.7090 |
3200 | 2005 | ford | diesel | 2121575 | clean | automatic | 4wd | pickup | 43.8515 | -88.8337 |
4450 | 2000 | toyota | gas | 2727000 | clean | automatic | rwd | truck | 30.3314 | -97.8666 |
6500 | 2004 | ford | gas | 1623262 | clean | automatic | fwd | pickup | 39.8498 | -104.6970 |
2500 | 2001 | chevrolet | gas | 2000000 | clean | automatic | 4wd | truck | 42.2580 | -76.1501 |
5000 | 2019 | chevrolet | gas | 4728224 | clean | automatic | NA | NA | 31.7814 | -106.3610 |
6000 | 2002 | ford | diesel | 3000000 | clean | automatic | 4wd | NA | 45.5781 | -111.1260 |
29800 | 2006 | ford | diesel | 1698750 | clean | automatic | 4wd | van | 45.5781 | -111.1260 |
6000 | 2007 | lincoln | gas | 1646500 | clean | automatic | rwd | SUV | 31.9010 | -106.5810 |
7500 | 1999 | ram | diesel | 1901724 | clean | automatic | 4wd | pickup | 40.6842 | -73.4171 |
32500 | 2011 | NA | diesel | 3250000 | clean | automatic | 4wd | pickup | 32.5768 | -85.3120 |
5700 | 1999 | chevrolet | gas | 1900000 | clean | automatic | 4wd | SUV | 33.3150 | -86.6680 |
6875 | 2012 | toyota | gas | 1570000 | clean | automatic | fwd | sedan | 39.0163 | -94.6950 |
3500 | 2006 | toyota | hybrid | 1987700 | clean | automatic | NA | SUV | 40.0255 | -105.2780 |
33000 | 2001 | NA | diesel | 1581983 | clean | manual | NA | NA | 40.8349 | -83.2513 |
24500 | 2015 | ford | gas | 2590000 | clean | automatic | 4wd | SUV | 39.2305 | -94.4338 |
1200 | 2003 | ford | gas | 2740000 | salvage | automatic | NA | mini-van | 38.8138 | -77.6220 |
18700 | 2006 | ram | diesel | 2051670 | clean | automatic | 4wd | truck | 37.3484 | -78.9978 |
500 | 1991 | chevrolet | gas | 2400000 | clean | automatic | 4wd | truck | 40.3431 | -81.5189 |
2000 | 2005 | honda | gas | 1940000 | clean | automatic | fwd | coupe | 41.3442 | -89.0955 |
5495 | 2008 | chevrolet | gas | 1600000 | clean | automatic | 4wd | truck | 39.7019 | -75.5708 |
650 | 2000 | NA | gas | 1666802 | clean | automatic | 4wd | SUV | 61.2004 | -149.6890 |
19000 | 2012 | NA | diesel | 7788852 | clean | manual | NA | NA | 32.2787 | -101.4580 |
7000 | 2003 | chevrolet | diesel | 2490000 | clean | automatic | 4wd | truck | 37.6690 | -97.2802 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
3500 | 2004 | toyota | gas | 2771110 | clean | automatic | NA | NA | 38.9087 | -90.1568 |
10500 | 1990 | NA | diesel | 9999999 | clean | manual | rwd | truck | 40.3582 | -79.8316 |
0 | 2020 | NA | gas | 9220158 | clean | automatic | NA | NA | 36.1612 | -80.3240 |
3200 | 2004 | toyota | gas | 2100000 | clean | automatic | NA | truck | 28.1780 | -80.6020 |
3899 | 2004 | chevrolet | gas | 1940000 | clean | automatic | 4wd | SUV | 41.4093 | -75.6417 |
2995 | 2001 | ford | gas | 9999999 | clean | automatic | rwd | pickup | 38.9861 | -95.6878 |
2700 | 2001 | lexus | gas | 2980000 | salvage | automatic | rwd | sedan | 30.4507 | -91.0908 |
4000 | 2011 | mazda | gas | 1700400 | clean | automatic | fwd | sedan | 34.2540 | -118.7880 |
3950 | 2001 | lexus | gas | 1970000 | clean | automatic | fwd | sedan | 34.0591 | -80.9640 |
2800 | 2004 | acura | gas | 1710000 | rebuilt | automatic | 4wd | SUV | 36.1730 | -115.0670 |
2000 | 2005 | chevrolet | gas | 2000000 | clean | automatic | NA | NA | 43.2952 | -86.2790 |
6995 | 2006 | lexus | gas | 2182111 | clean | automatic | NA | NA | 29.9871 | -90.1695 |
3900 | 2009 | gmc | gas | 1950000 | clean | automatic | NA | NA | 33.7408 | -84.2632 |
3500 | 1998 | nissan | diesel | 2007000 | clean | other | rwd | truck | 37.9691 | -122.5100 |
5500 | 2001 | ford | diesel | 2580000 | clean | automatic | 4wd | pickup | 40.6152 | -74.4150 |
7900 | 2008 | audi | gas | 64809218 | NA | automatic | 4wd | convertible | 39.6963 | -104.8950 |
15500 | 2012 | infiniti | gas | 3840000 | clean | automatic | rwd | coupe | 39.1552 | -123.1950 |
1500 | 1999 | dodge | gas | 1999999 | clean | automatic | rwd | NA | 39.9771 | -74.1565 |
32000 | 2015 | gmc | gas | 6500000 | clean | automatic | fwd | pickup | 48.9028 | -114.9290 |
2500 | 2004 | pontiac | gas | 1990000 | salvage | automatic | NA | NA | 48.8522 | -95.5356 |
7500 | 2009 | honda | gas | 1553535 | clean | automatic | rwd | SUV | 41.6008 | -93.5791 |
12000 | 2004 | ford | diesel | 2400000 | clean | automatic | 4wd | pickup | 40.4070 | -74.0094 |
3900 | 2007 | ford | gas | 2220000 | clean | automatic | NA | SUV | 43.2692 | -93.6356 |
9985 | 2009 | ram | gas | 1850000 | clean | automatic | rwd | truck | 41.3719 | -96.1527 |
1550 | 2000 | jeep | gas | 2098818 | salvage | automatic | 4wd | NA | 35.2023 | -88.2005 |
5000 | 2006 | infiniti | gas | 2130000 | clean | automatic | 4wd | SUV | 46.8176 | -96.0670 |
8500 | 2012 | dodge | gas | 7691300 | clean | automatic | fwd | SUV | 40.9116 | -74.2605 |
3500 | 2008 | nissan | gas | 1650000 | clean | automatic | fwd | sedan | 40.8219 | -74.4200 |
32000 | 2017 | chevrolet | gas | 3480400 | clean | automatic | 4wd | truck | 40.6321 | -89.6903 |
600 | 1998 | jeep | gas | 2380567 | clean | automatic | 4wd | SUV | 37.6691 | -77.3294 |
The odometer values above 1000000 are too high and seem unreasonable. The model should be realistic and not dependent completely on data. Hence we will remove the rows with odometer values greater than 1000000.
df = filter(df, df$odometer <= 1000000)
boxplot(df$odometer, ylab = 'Odometer Distribution')
There still exists many outliers. Thus the same interquartile range
(IQR) method is applied here.
stats = summary(df$odometer)
interquartile = stats[5] - stats[2]
thresh = interquartile * 1.5
upper = stats[5] + thresh
lower = stats[2] - thresh
df = filter(df,between(df$odometer,lower,upper))
boxplot(df$price, ylab='Odometer reading Distribution')
We have achieved a good spread now, and that will be enough for now. Let us analyze the relationship between odometer and the price of the car.
plot(
df$odometer,
df$price,
xlab = 'Odometer reading',
ylab = 'Price',
main = 'Odometer reading vs price',
col = '#009999'
)
abline(lm(price~odometer,data=df),col='#ff9933', lwd=3)
The odometer vs price plot draws the following conclusions:
Let us now look at the nominal or ordinal variables that we have. Lets understand the manufacturer and type columns. The analysis on the data is to show vehicles of which manufacturers and type of vehicles are selling most online.
We find the Top 10 manufacturers listed on craigslist based on two criteria.
grouped = as.data.frame(df %>% group_by(manufacturer) %>% summarise(avg_price = mean(price)))
top_sold = head(grouped[order(- grouped$avg_price),],9)
top_sold
manufacturer | avg_price | |
---|---|---|
<chr> | <dbl> | |
37 | tesla | 22380.25 |
2 | alfa-romeo | 19348.85 |
33 | ram | 16847.51 |
34 | rover | 14522.15 |
3 | aston-martin | 14244.25 |
14 | gmc | 14052.63 |
32 | porche | 13783.22 |
20 | jeep | 13457.74 |
4 | audi | 12985.45 |
top_manufacturers = sort(table(df$manufacturer),decreasing = TRUE)[0:10]
par( mfrow = c(1,2), mar=c(5,6,1,2))
barplot(
top_manufacturers,
horiz = TRUE,
las=TRUE,
col = '#009999',
xlab = 'Number of cars sold',
main = 'Number of cars sold per manufacturer - top 10'
)
barplot(
names.arg = top_sold$manufacturer,
height = top_sold$avg_price,
horiz = TRUE,
las=TRUE,
col = '#009999',
xlab = 'Average price range',
main = 'Average price range of cars sold per manufacturer - top 10'
)
The first plot above shows the manufacturers with the most sold vehicles. These vehicles are most prefered by buyers according to the dataset. The same manufacturers are not ones with the cars priced at higher ranges. The following conclusions are drawn :-
Next let us look at the type column which describes the type of a car. Type of a car can describe what type of cars buyers are generally looking for.
grouped = df %>% group_by(type) %>% summarise(avg_price = mean(price))
top_sold = head(grouped[order(- grouped$avg_price),],9)
top_sold
type | avg_price |
---|---|
<chr> | <dbl> |
pickup | 16617.25 |
other | 14975.13 |
truck | 14617.34 |
offroad | 13199.23 |
coupe | 12594.72 |
van | 11627.16 |
SUV | 11514.62 |
convertible | 11512.44 |
bus | 10564.27 |
top_types = sort(table(df$type),decreasing = TRUE)[0:10]
par(mar=c(5,6,1,2), mfrow = c(1,2))
barplot(
top_types,
horiz = TRUE,
las=TRUE,
col = '#009999',
xlab = 'Number of cars sold',
main = 'Number of cars sold per car type - top 10'
)
barplot(
names.arg = top_sold$type,
height = top_sold$avg_price,
horiz = TRUE,
las = TRUE,
col = '#009999',
xlab = 'Average price range',
main = 'Average price range per car type sold - top 10'
)
The first plot shows the most sold type of cars. The sedans and SUVs are most popular amongst the buyers. But these types are not the ones that are at higher price range. Pickups and trucks are at higher price ranges. This draws the conclusion that :-
The two sets of plots tells us that buyers prefer cars that are at the lower or mediocre price range.
Manufacturer of a car does not directly add to the price prediction. Since no linear relationship can be established. The types column also fails to give a generalization in terms of price. A sedan can have a lower price point or a higher price point depending on alot of factors. Hence both the columns are not going to be considered.
Let us now look at the fuel column. It describes the fuel type the car consumed. Intuitvely vehicles with different fuel capabilites have diverse price ranges.
print(unique(df$fuel))
grouped = df %>% group_by(fuel) %>% summarise(avg_price = mean(price))
fuel_type = grouped[0:5,]
barplot(
names.arg = fuel_type$fuel,
height = fuel_type$avg_price,
col = '#009999',
xlab = 'Fuel type',
ylab = 'Average price range',
main = 'Average price range of cars sold per fuel type'
)
[1] "gas" "diesel" NA "other" "electric" "hybrid"
As assumed, the prices of cars differ between different fuel types. The cars running on disel or electricity are high priced as compared to the rest. Hybrid and gas cars are among the lowest. Fuel type can be a good predictor of the price as it can clearly differentitate between cars and its price range.
The title_status column describes titles associated with the car. A car can be rebuilt, salvaged or clean (almost brand new). A car can even be under lien i.e. under a loan. Intuitively, this column should be a good predictor for the price of a car. Since the title attached to a car can be a deciding factor of its price. A car with the clean status will have a good price as compared to a salvaged or rebuilt car.
print(unique(df$title_status))
grouped = df %>% group_by(title_status) %>% summarise(avg_price = mean(price))
car_status = grouped[1:6,]
car_status
[1] "clean" "salvage" "rebuilt" "lien" NA [6] "missing" "parts only"
title_status | avg_price |
---|---|
<chr> | <dbl> |
clean | 11482.988 |
lien | 16417.128 |
missing | 2698.858 |
parts only | 3742.500 |
rebuilt | 10064.677 |
salvage | 8596.156 |
par(mfrow = c(1,2))
barplot(
table(df$title_status),
xlab = 'Car status',
ylab = 'Number of cars sold',
col = '#ff9933',
main = 'Number of cars sold per status'
)
barplot(
names.arg = car_status$title_status,
height = car_status$avg_price,
col = "#009999",
xlab = 'Car status',
ylab = 'Average price range',
main = 'Average price range of car sold per status'
)
The following conclusions can be drawn from the above plot :-
As hypothesised before, all statuses show differences in price ranges and hence can be estimated to be a good predictor for the price of a car.
print(unique(df$transmission))
grouped = df %>% group_by(transmission) %>% summarise(avg_price = mean(price))
transmission_types = grouped[1:3,]
transmission_types
[1] "automatic" NA "manual" "other"
transmission | avg_price |
---|---|
<chr> | <dbl> |
automatic | 11182.669 |
manual | 9529.586 |
other | 18130.560 |
par(mfrow = c(1,2))
barplot(
table(df$transmission),
xlab = 'Transmission types',
ylab = 'Number of cars sold',
col = '#ff9933',
main = 'Number of cars sold per transmission type of cars'
)
barplot(
names.arg = transmission_types$transmission,
height = transmission_types$avg_price,
col = "#009999",
xlab = 'Transmission types',
ylab = 'Average price range',
main = 'Average price range of car sold per transmission type'
)
Conclusions :-
These conclusions lead to the argument that, transmission of a vehcile offers limited influence in predicting the price of a car.
The next column to analyze is the drive column. The drive column describes the drive wheel of the car. It could be forward wheel drive or rear wheel drive or all four wheel drive.
print(unique(df$drive))
grouped = df %>% group_by(drive) %>% summarise(avg_price = mean(price))
drive_types = grouped[1:3,]
drive_types
[1] NA "rwd" "fwd" "4wd"
drive | avg_price |
---|---|
<chr> | <dbl> |
4wd | 13886.081 |
fwd | 8879.667 |
rwd | 11890.166 |
par(mfrow = c(1,2))
barplot(
table(df$drive),
xlab = 'Wheel drive types',
ylab = 'Number of cars sold',
col = '#ff9933',
main = 'Number of cars sold per wheel drive type of cars'
)
barplot(
names.arg = drive_types$drive,
height = drive_types$avg_price,
col = "#009999",
xlab = 'Wheel drive types',
ylab = 'Average price range',
main = 'Average price range of car sold per wheel drive type'
)
The conclusions :-
The only columns left to analyze now are the Latitude and Longitude, identifying the location of car being sold online. Locations usually have effects on the prices of cars. Different states and regions have different tax brackets and different purchasing power thus bringing a change in prices for a car. The plot below maps all these locations on world map.
df = filter(df,!(is.na(df$lat) | is.na(df$long)))
# us = map_data('world')
# ggplot() + geom_polygon(data = us, aes(y = lat,x = long, group = group), fill='light grey', color='black') +
# geom_point(data = df, aes(x = long, y = lat), color = '#009999')
theme_set(theme_bw())
world = ne_countries(scale = 'medium', returnclass = 'sf')
ggplot(data = world) + geom_sf() + geom_point(data = df, aes(x=long, y=lat), color = '#009999') +
ggtitle("World Map", subtitle = 'Location of listed cars') +
xlab("Longitude") +
ylab("Latitude")
There are alot of outliers in these columns. Some of the locations are over the oceans which is obviously faulty. Since majority of the points lie in the United States of America, we will use the interquartile range
(IQR) method to remove the outliers. Once the outliers are removed, the locations are plotted against the United States of America map to identify the spread of data.
stats_lng = summary(df$long)
interquartile = stats_lng[5] - stats_lng[2]
thresh = interquartile * 1.5
upper = stats_lng[5] + thresh
lower = stats_lng[2] - thresh
df = filter(df,between(df$long,lower,upper))
stats_lt = summary(df$lat)
interquartile = stats_lt[5] - stats_lt[2]
thresh = interquartile * 1.5
upper = stats_lt[5] + thresh
lower = stats_lt[2] - thresh
df = filter(df,between(df$lat,lower,upper))
head(df,5)
price | year | manufacturer | fuel | odometer | title_status | transmission | drive | type | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
1 | 10299 | 2012 | acura | gas | 90186 | clean | automatic | NA | other | 35.7636 | -78.7443 |
2 | 9500 | 2011 | jaguar | gas | 85000 | clean | automatic | NA | NA | 36.1032 | -79.8794 |
3 | 3995 | 2004 | honda | gas | 212526 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 |
4 | 12995 | 2015 | kia | gas | 85127 | clean | automatic | 4wd | wagon | 35.7636 | -78.7443 |
5 | 10995 | 2014 | chevrolet | gas | 112383 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 |
us = map_data('state')
ggplot() + geom_polygon(data = us, aes(y = lat,x = long, group = group), fill='light grey', color='black') +
geom_point(data = df, aes(x = long, y = lat), color = '#009999') +
ggtitle("USA Map", subtitle = 'Location of listed cars') +
xlab("Longitude") +
ylab("Latitude")
We can clearly see the high density areas where cars are listed from on craigslist. Location as discussed is an important factor and hence the Latitude and Longitude values cannot be ignored. To get more meaning out of the data, we will apply clustering algorithm, K-means to find clusters region wise in the data.
set.seed(123)
spatial_points = df[,c('lat','long')]
wss <- function(k) {
kmeans(spatial_points, k, nstart = 10 )$tot.withinss
}
k.values <- 1:15
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
#fviz_nbclust(df, kmeans, method = "wss")
The Elbow Method gives us the optimal number of clusters to be chosen. The plot suggests that an optimal number of clusters would be 6.
mod = kmeans(spatial_points, 6, nstart = 25)
print(mod$centers)
lat long 1 41.90470 -87.28916 2 35.41935 -98.72366 3 35.79967 -117.86442 4 40.82153 -75.08626 5 32.64633 -83.32673 6 45.73295 -119.66174
Now that we have the clusters, we can view the clusters on the map as below. It shows us the area under each cluster formed.
fviz_cluster(mod,data = spatial_points[,c('long','lat')])
We can use these clusters as location information and see if it adds to the dataset and predictive modeling. Since the Latitude and Longitude values are geospatial values, they cannot exactly show a linear relationship, but hopefully these cluster values will.
df['region'] = mod$cluster
head(df,5)
price | year | manufacturer | fuel | odometer | title_status | transmission | drive | type | lat | long | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <int> | |
1 | 10299 | 2012 | acura | gas | 90186 | clean | automatic | NA | other | 35.7636 | -78.7443 | 5 |
2 | 9500 | 2011 | jaguar | gas | 85000 | clean | automatic | NA | NA | 36.1032 | -79.8794 | 5 |
3 | 3995 | 2004 | honda | gas | 212526 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 | 5 |
4 | 12995 | 2015 | kia | gas | 85127 | clean | automatic | 4wd | wagon | 35.7636 | -78.7443 | 5 |
5 | 10995 | 2014 | chevrolet | gas | 112383 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 | 5 |
Now that we have all the independent variables ready, we will build the model.
rm_cols = c(
'manufacturer'
)
df = df[!names(df) %in% rm_cols]
df = na.omit(df)
df = filter(df,df$price != 0)
head(df,5)
price | year | fuel | odometer | title_status | transmission | drive | type | lat | long | region | |
---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <int> | |
1 | 3995 | 2004 | gas | 212526 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 | 5 |
2 | 12995 | 2015 | gas | 85127 | clean | automatic | 4wd | wagon | 35.7636 | -78.7443 | 5 |
3 | 10995 | 2014 | gas | 112383 | clean | automatic | fwd | SUV | 35.7636 | -78.7443 | 5 |
4 | 3995 | 2007 | gas | 185123 | clean | automatic | fwd | convertible | 35.7636 | -78.7443 | 5 |
5 | 7995 | 2008 | gas | 162214 | clean | automatic | fwd | pickup | 35.7636 | -78.7443 | 5 |
dummied_cols = c(
'fuel',
'title_status',
'type',
'drive',
'transmission'
)
df = fastDummies::dummy_cols(df,remove_first_dummy = FALSE,ignore_na=TRUE)
df = df[,!names(df) %in% dummied_cols]
head(df,5)
price | year | odometer | lat | long | region | fuel_gas | fuel_diesel | fuel_hybrid | fuel_other | ⋯ | type_pickup | type_truck | type_coupe | type_sedan | type_offroad | type_van | type_other | type_hatchback | type_mini-van | type_bus | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |
1 | 3995 | 2004 | 212526 | 35.7636 | -78.7443 | 5 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 12995 | 2015 | 85127 | 35.7636 | -78.7443 | 5 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 10995 | 2014 | 112383 | 35.7636 | -78.7443 | 5 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 3995 | 2007 | 185123 | 35.7636 | -78.7443 | 5 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 7995 | 2008 | 162214 | 35.7636 | -78.7443 | 5 | 1 | 0 | 0 | 0 | ⋯ | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
set.seed(9)
index = sort(sample(nrow(df),nrow(df)*0.8))
train_set = df[index,]
test_set = df[-index,]
train_set[,2:26] = as.data.frame(scale(train_set[,2:26]))
test_set[,2:26] = as.data.frame(scale(test_set[,2:26]))
head(train_set,5)
model = lm(price ~ ., data = train_set)
summary(model)
Call: lm(formula = price ~ ., data = train_set) Residuals: Min 1Q Median 3Q Max -32985 -2734 -332 2479 38380 Coefficients: (5 not defined because of singularities) Estimate Std. Error t value (Intercept) -1264841.3001667 5711.5390444 -221.454 year 634.4522830 2.7974269 226.799 odometer -0.0459816 0.0002768 -166.105 lat -34.8503118 2.5737276 -13.541 long -48.6059563 0.7943343 -61.191 region 112.2425414 7.5082069 14.949 fuel_gas 203.5568339 275.5666466 0.739 fuel_diesel 6221.1314086 282.0288637 22.058 fuel_hybrid 1991.0446310 296.7825989 6.709 fuel_other 297.0874199 286.8522008 1.036 fuel_electric NA NA NA title_status_clean 3759.5147034 892.8405831 4.211 title_status_salvage 184.8517590 900.1574613 0.205 title_status_rebuilt 725.1036629 895.7112320 0.810 title_status_lien 4832.5565870 902.9966119 5.352 title_status_missing 2390.6650680 1057.4232495 2.261 `title_status_parts only` NA NA NA transmission_automatic -900.9023882 54.9570059 -16.393 transmission_other 324.4152141 78.1313343 4.152 transmission_manual NA NA NA drive_fwd -3135.4859589 40.2641785 -77.873 drive_4wd 1319.6082815 42.0174297 31.406 drive_rwd NA NA NA type_SUV -513.2400414 410.9645936 -1.249 type_wagon -1781.8930315 416.2457747 -4.281 type_convertible 1657.9554187 417.3907380 3.972 type_pickup 3396.0559215 411.1225620 8.260 type_truck 3070.6543166 410.9313266 7.472 type_coupe 787.0943786 412.9814787 1.906 type_sedan -2064.1511306 410.6521268 -5.027 type_offroad 2861.6222654 482.6209026 5.929 type_van 786.0750608 416.3298094 1.888 type_other 35.5446382 415.2656646 0.086 type_hatchback -2765.4516380 414.9861533 -6.664 `type_mini-van` 281.4961729 419.5468457 0.671 type_bus NA NA NA Pr(>|t|) (Intercept) < 0.0000000000000002 *** year < 0.0000000000000002 *** odometer < 0.0000000000000002 *** lat < 0.0000000000000002 *** long < 0.0000000000000002 *** region < 0.0000000000000002 *** fuel_gas 0.4601 fuel_diesel < 0.0000000000000002 *** fuel_hybrid 0.0000000000196974 *** fuel_other 0.3004 fuel_electric NA title_status_clean 0.0000254690350053 *** title_status_salvage 0.8373 title_status_rebuilt 0.4182 title_status_lien 0.0000000872651611 *** title_status_missing 0.0238 * `title_status_parts only` NA transmission_automatic < 0.0000000000000002 *** transmission_other 0.0000329508198372 *** transmission_manual NA drive_fwd < 0.0000000000000002 *** drive_4wd < 0.0000000000000002 *** drive_rwd NA type_SUV 0.2117 type_wagon 0.0000186281928059 *** type_convertible 0.0000712481459667 *** type_pickup < 0.0000000000000002 *** type_truck 0.0000000000000792 *** type_coupe 0.0567 . type_sedan 0.0000005000355129 *** type_offroad 0.0000000030482392 *** type_van 0.0590 . type_other 0.9318 type_hatchback 0.0000000000267464 *** `type_mini-van` 0.5023 type_bus NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4807 on 149475 degrees of freedom Multiple R-squared: 0.6484, Adjusted R-squared: 0.6484 F-statistic: 9190 on 30 and 149475 DF, p-value: < 0.00000000000000022
pred_train = predict(model,train_set[!names(test_set) %in% c('price')])
pred_test = predict(model,test_set[!names(train_set) %in% c('price')])
cat("Train RMSE: ",RMSE(pred_train,train_set$price),"\n")
cat("Train R-squared: ",R2(pred_train,train_set$price),"\n")
cat("Test RMSE: ",RMSE(pred_test,test_set$price),"\n")
cat("Test R-squared: ",R2(pred_test,test_set$price))
Train RMSE: 4806.349 Train R-squared: 0.6484367 Test RMSE: 4791.563 Test R-squared: 0.64748
The Linear regression plot shows a good fit with the adjusted R-squared value to be at 64.84% on the train set. It is a reasonable fit.
predictions = predict(model,test_set[!names(test_set) %in% c('price')])
residual = predictions - test_set$price
ggplot() +
geom_point(aes(x= predictions, y= residual), color='#009999') +
geom_abline(slope = 0, intercept = 0, color='#ff9933', lwd = 2) +
xlab('predicted price') +
ylab('residual') +
ggtitle('Residual plot')
res_data = as.data.frame(residual)
m = mean(res_data$residual)
s = sd(res_data$residual)
ggplot(data = res_data, aes(x = residual)) +
geom_histogram(aes(y = ..density..), fill='#009999', bins=25) +
stat_function(fun = dnorm, args = list(mean= m,sd=s), color= '#ff9933', size=2)
A residual plot shows the variance of error the plot makes in the predictions. The variance seems well distributed at the 0 point. This shows that the residuals are homoskedastic. Which is a favourable condition.
In conclusion we can say, the following characteristics are good predictors of the price of a car:-
With these predictors, a good fit is seen with normally distributed residual error and a fair RMSE value.