--- layout: post title: "tbl: abstractions and imports" author: mjadud commit: https://github.com/jadudm/pytbl/tree/527b16bdecbf73b874103922cf3038a1f2c1e1c7 tags: - cc-sa - tbl - blog - "2020" - 2020-03 date: 2020-03-08 --- There is a debate in the data science community (and, in particular, in the R community) as to whether one should learn libraries or a core language when working with data. For R programmers, it is a question of learning the dyplr-family of libraries vs. working directly in the language without those tools. This is, from what I can gather, a sometimes divisive argument. As an educator and a developer, I've come to appreciate the power of a good abstraction and tools that support that abstraction. I want tools that help me map the way I think about a problem directly into code. Or, I want tools that will shape the way I think about problems, so that I can more concisely express solutions using those tools. Here, "tools" means "libraries" or "programming languages." My approach to working on `tbl` has therefore been to think about how to make it easy for beginners to work with interesting data. "Interesting" might mean it is personally meaningful, and possibly a small amount of data. "Interesting" might mean it is large and complex data... but, important to the developer. This means `tbl` needs to support data that is both small and big, and it needs to be easy for a developer to get started. ## Imports: The `tbl` I want my beginner to be thinking about tabular data. So, I want a `tbl` to make it easy to turn a spreadsheet into something that they can do meaningful work with. In this way, the first abstraction that a programmer sees with `tbl` is the spreadsheet, and they can map that abstraction directly into the library. A `tbl` is, in the first instance, a spreadsheet. Blue lobster photo by David Clode on Unsplash. Here, for example, is the [spreadsheet](https://docs.google.com/spreadsheets/d/1aCjhAepc2Ms-eIr97hPb2FvzEKmqWK1w6mH8MnVUtRs/edit?usp=sharing) I use to keep track of my pet lobsters.
In the case that I have small, but interesting data, it would be nice if I could have a GUI for manipulating/entering that data, and could quickly pull it into a program that I'm writing without having to go through lots of hoops. **If I want a good GUI for manipulating tabular data, I should use a spreadsheet!** As it happens, not only can I use Google Sheets for this, but Sheets will let me publish my data to the web for embedding, and Sheets also makes it easy for me to pull the CSV directly. But, I don't want a programmer to know that there's a CSV file waiting for them... I just want them to be able to import the data. Something that might look like this: ```python import tbl pets_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSK2rd47ogfI2CpQi2L6HDo9AOEhnhqBN4zR4kLPUO28vBzmlc8XQWrvTfBYCU0ePf478yxcNKdOy5m/pub?gid=0&single=true&output=csv" pets_tbl = tbl_from_sheet(pets_url) ``` To test this out, I'll drop some code in `lobsters.py` and `tbl.py`. When I run `python lobsters.py` I get the following: ``` (venv) jadudm@lego:~/git/pytbl$ python lobsters.py ['Name', 'Weight', 'Color'] ['Bart', '0.75', 'Muddy'] ['Flick', '1', 'Muddy'] ['Bubbles', '1.2', 'Blue'] ['Crabby', '0.5', 'Muddy'] ``` Now, this doesn't get us all the way, but it takes the first step: I've created a data table in Google Sheets, and I can pull it in via the Requests library as a CSV document that is parseable and iterable. So far, so good. ## Abstractions The next step is to design the abstraction for a `tbl`. So that no one post gets too long, this will be the subject of tomorrow's explorations. The goal here will be to avoid creating an [abstraction that is overly leaky](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/), to use Joel Spolsky's terminology. I'm going to want a way to work with this data that: * Can store the data locally or remotely * Can work with centralized and decentralized data * Can leverage multiple concrete representations, invisibly * Can operate on the data closer conceptually rather than syntactically * Can support programmers at multiple levels of experience and expertise These are going to be a complex set of requirements, and I'll miss the first time. (This is actually the *second time* I've explored this idea; I've already done a deep dive in the programming language Racket, so in truth, I've got some ideas in my back pocket.) For the exercise that this is, I'll probably do the following: * Explore SQLite for local and CockroachDB for remote/distributed data * Use an ORM (SQLalchemy, Peewee, or similar) to manage those relationships * Use R, Python (pandas), Pyret, and other data languages/frameworks as inspiration * Choose some authentic use-cases to drive the development (e.g. perhaps interface into some of my own research data to drive both the research and the development of `tbl` forward) ## Get the code! It's early days, but you can get the code. This work will be open (as all of my work is, whenever possible), at Github. I'll call the project [pytbl](https://github.com/jadudm/pytbl).