import datetime
from datetime import date
= date.today()
today print(today)
type(today)
2024-01-18
datetime.date
glue
and purrr
. This post demonstrates a few miscellaneous tools for date manipulation, string interpolation (f-strings), and iteration (list comprehensions)
Emily Riederer
January 20, 2024
In the past few weeks, I’ve been writing about a stack of tools and specific packages like polars
that may help R users feel “at home” when working in python due to similiar ergonomics. However, one common snag in switching languages is ramping up on common “recipes” for higher-level workflows (e.g. how to build a sklearn
modeling pipeline) but missing a langugage’s fundamentals that make writing glue code feel smooth (and dare I say pleasant?) It’s a maddening feeling to get code for a complex task to finish only to have the result wrapped in an object that you can’t suss out how to save or manipulate.
This post goes back to the basics. We’ll briefly reflect on a few aspects of usability that have led to the success of many workflow packages in R. Then, I’ll demonstrate some a grabbag of coding patterns in python that make it feel more elegant to connect bits of code into a coherent workflow.
We’ll look at the kind of functinoality that you didn’t know to miss until it was gone, you’re may not be quite sure what to search to figure out how to get it back, and you wonder if it’s even reasonable to hope there’s an analog1. This won’t be anything groundbreaking – just some nuts and bolts. Specifically: helper functions for data and time manipulation, advanced string interpolation, list comprehensions for more functional programming, and object serialization.
R’s passionate user and developer community has invested a lot in building tools that smooth over rough edges and provide slick, concise APIs to rote tasks. Sepcifically, a number of packages are devoted to:
fs
for naviating file systems or lubridate
for more semantic date wranglingcli
and glue
to improve human readability of terminal output and string interpolationpurrr
which provides a concise, typesafe interface for iterationAll of these capabilities are things we could somewhat trivially write ourselves, but we don’t want to and we don’t need to. Fortunately, we don’t need to in python either.
I don’t know a data person who loves dates. In the R world, many enjoy lubridate
’s wide range of helper functions for cleaning, formatting, and computing on dates.
Python’s datetime
module is similarly effective. We can easily create and manage dates in date
or datetime
classes which make them easy to work with.
2024-01-18
datetime.date
Two of the most important functions are strftime()
and strptime()
.
strftime()
formats dates into strings. It accepts both a date and the desired string format. Below, we demonstrate by commiting the cardinal sin of writing a date in non-ISO8601.
01/18/2024
str
strptime()
does the opposite and turns a string encoding a date into an actual date. It can try to guess the format, or we can be nice and provide it guidance.
someday_dtm = datetime.datetime.strptime('2023-01-01', '%Y-%m-%d')
print(someday_dtm)
type(someday_dtm)
2023-01-01 00:00:00
datetime.datetime
Date math is also relatively easy with datetime
. For example, you can see we calculate the date difference simply by… taking the difference! From the resulting delta object, we can access the days
attribute.
R’s glue
is beloved for it’s ability to easily combine variables and texts into complex strings without a lot of ugly, nested paste()
functions.
python has a number of ways of doing this, but the most readable is the newest: f-strings. Simply put an f
before the string and put any variable names to be interpolated in {
curly braces}
.
f-strings also support formatting with formats specified after a colon. Below, we format a long float to round to 2 digits.
Any python expression – not just a single variable – can go in curly braces. So, we can instead format that propotion as a percent.
Despite the slickness of f-strings, sometimes other string interpolation approaches can be useful. For example, if all the variables I want to interpolate are in a dictionary (as often will happen, for example, with REST API responses), the string format()
method is a nice alternative. It allows us to pass in the dictionary, “unpacking” the argument with **
2
result = {
'dog_name': 'Squeak',
'dog_type': 'Chihuahua'
}
print("{dog_name} is a {dog_type}".format(**result))
Squeak is a Chihuahua
Combining what we’ve discussed about datetime
and f-strings, here’s a pattern I use frequently. If I am logging results from a run of some script, I might save the results in a file suffixed with the run timestamp. We can generate this easily.
Thanks in part to a modern-day fiction that for
loops in R are inefficient, R users have gravitated towards concise mapping functions for iteration. These can include the *apply()
family3, purrr
’s map_*()
functions, or the parallelized version of either.
Python too has a nice pattern for arbitrary iteration in list comprehensions. For any iterable, we can use a list comprehension to make a list of outputs by processing a list of inputs, with optional conditional and default expressions.
Here are some trivial examples:
[2, 2, 4]
There are also closer analogs to purrr
like python’s map()
function. map()
takes a function and an iterable object and applies the function to each element. Like with purrr
, functions can be anonymous (as defined in python with lambda functions) or named. List comprehensions are popular for their concise syntax, but there are many different thoughts on the matter as expressed in this StackOverflow post.
[2, 3, 4]
As a more realistic example, let’s consider how list comprehensions might help us conduct a numerical simulation or sensitivity analysis.
Suppose we want to simulate 100 draws from a Bernoulli distribution with different success probabilites and see how close our empirically calculated rate is to the true rate.
We can define the probabilites we want to simulate in a list, use a list comprehension to create the simulations. We then have a list-of-lists of results.
import numpy.random as rnd
probs = [0.1, 0.25, 0.5, 0.75, 0.9]
coin_flips = [ rnd.binomial(1, p, 100).tolist() for p in probs ]
len(coin_flips)
5
To conduct our analysis, we can put these into a polars
dataframe.
prob | flip |
---|---|
f64 | list[i64] |
0.1 | [0, 0, … 0] |
0.25 | [0, 0, … 0] |
0.5 | [1, 1, … 0] |
0.75 | [1, 1, … 1] |
0.9 | [1, 1, … 1] |
To analyze our data, we can then “blow up” our list-of-lists (going from a 5-row dataset to a 500-row dataset) and aggregate the results.
Sometimes, it can be useful to save objects as they existed in RAM in an active programming environment. R users may have experienced this if they’ve used .rds
, .rda
, or .Rdata
files to save individual variables or their entire environment. These objects can often be faster to reload than plaintext and can better preserve information that may be lost in other formats (e.g. storing a dataframe in a way that preserves its datatypes versus writing to a CSV file4 or storing a complex object that can’t be easily reduced to plaintext like a model with training data, hyperparameters, learned tree splits or weights or whatnot for future predictions.) This is called object serializaton5
Python has comparable capabilities in the pickle
module. There aren’t really style points here, so I’ve not much to add beyond “this exists” and “read the documentation”. But, at a high level, it looks something like this:
I defined this odd scope to help limit the infinite number of workflow topics that could be included like “how to write a function” or “how to source code from another script”↩︎
This is called “**kwargs”. You can read more about it here.↩︎
Speaking of non-ergonomic things in R, the *apply()
family is notoriously diverse in its number and order of arguments↩︎
And yes, you can and should use Parquet and then my example falls apart – but that’s not the point!↩︎
And, if you want to go incredibly deep here, check out this awesome post by Danielle Navarro.↩︎