Struggles with a GCP Cloud Functions Stacktrace
I decided to explore restructuring some of our ETL processes that feed data to BigQuery in order to have them stop saving data files locally. Currently, we pull data from multiple sources, apply some transformations to them, save local copies of the transformations as well as push the latter to BigQuery. In doing this, we essentially make it possible for our on-premise data silos to tap into those locally saved files.
However, as you may have noticed from my about me page, I don’t work for Google or Amazon. We don’t just have hardware with tons of storage waiting for me to fill up. I have no VM with a storage capacity above 50 Gb that is dedicated to a non-database file system. Only the database file systems get to go past that hard limit. In many ways, this makes sense since compression tools and log aggregators have gotten very good over the years. So, I don’t want to keep data files lying around on machines with somewhat limited storage once I have processed them. But I still have to process tons of data every day and push it to BigQuery. Of course, the answer had to lie somewhere between Storage and BigQuery. And, you guessed it right! It’s Cloud Functions.
These suckers are so good that they will let you trigger some cool operations just because a file got dropped off in a given Storage bucket. The cool kids with a certain “Web Services” persuasion will say: but this is exactly what lambdas do!. And then, I say: cooool!.
So, how do we go from local files to stacktraces on some public cloud? If you’re already bored reading this, the answer is:
- write some buggy/crashing code,
- then
gcloud functions deploy, - and finally
gstuil cp.
But, for those of us with a modicum of patience, let’s find out in a much more elegant manner.
Deploying a function
The goal here is to write a few lines of python code to import data from Storage to Bigquery. Lucky for us, this already comes in a very well maintained package unsurprisingly called google-cloud-bigquery. Using that package, we get a valid path to our data file in Storage, create a connection to BigQuery and load our data. Evidently, we need to make sure the destination datasets and tables exist in BigQuery. The following does exactly that:
| """ | |
| Load data from Storage to BigQuery | |
| """ | |
| import os | |
| from google.cloud import bigquery | |
| from google.cloud.exceptions import ( | |
| Conflict, NotFound | |
| ) | |
| import utils | |
| def hello_gcs(event, context): | |
| """Triggered by a change to a Cloud Storage bucket. | |
| Args: | |
| event (dict): Event payload. | |
| context (google.cloud.functions.Context): Metadata for the event. | |
| """ | |
| project = os.getenv('PROJECT_ID', '') | |
| gs_path = os.path.join( | |
| event['bucket'], | |
| event['name'] | |
| ) | |
| bfname = os.path.basename(event['name']) | |
| ds, tbl, *_ = bfname.rsplit('_', 2) | |
| course_id = utils.ds_to_course(ds) | |
| client = bigquery.Client(project) | |
| dref = bigquery.Dataset(f"{project}.{ds}") | |
| dref.description = utils.DS_DESC.format(id=course_id) | |
| schema = [bigquery.SchemaField(*f) for f in utils.SCHEMA['fields']] | |
| load_config = utils.make_load_config(schema) | |
| tref = dref.table(tbl) | |
| tref.description = utils.TBL_DESC.format(id=course_id, t=tbl) | |
| try: | |
| client.get_dataset(dref) | |
| except NotFound: | |
| try: | |
| client.create_dataset(dref) | |
| except Conflict: | |
| pass | |
| try: | |
| client.get_table(tref) | |
| except NotFound: | |
| try: | |
| client.create_table(tref) | |
| except Conflict: | |
| pass | |
| job = client.load_table_from_uri( | |
| source_uris='gs://{p}'.format(p=gs_path), | |
| destination=tref, | |
| job_config=load_config | |
| ) | |
| utils.wait_for_job(job) | |
| return True |
utils.py module is a file with some useful constants and functions defined to help with dataset and table names in BigQuery, as well as to wait for the load job to finish.The deployment shell script looks like the following:
| #!/bin/bash | |
| # Define constants and a function | |
| GS_BUCKET="our-so-very-unique-bucket" | |
| FUNC_NAME="storage-2-bq-loader" | |
| EVENT="google.storage.object.finalize" | |
| function leave() { | |
| echo "${1} failed" | |
| exit 1 | |
| } | |
| # Change into the directory with the source code | |
| cd /path/to/function/directory | |
| # Start the deployment | |
| echo `date` "Deploying the data loader GCP Function" | |
| gcloud functions deploy --runtime python37 --env-vars-file env.yaml \ | |
| --trigger-bucket ${GS_BUCKET} --trigger-event ${EVENT} \ | |
| ${FUNC_NAME} || leave gcloud | |
| # If you got here, then you should be good to go | |
| echo `date` "Function ${FUNC_NAME} deployed" | |
| exit |
The main issues
Looking at that python script, there are so many places where something could go wrong:
- Line 17 makes this wild assumption that the
PROJECT_IDenvironment variable is actually set. It was provided togcloud functions deployvia the--env-vars-fileoption, but it may very well be possible that it will not actually propagate.- The subsquent bit of code after that (lines 18-21) does assume that
GCPwill run our function on a unix machine, since we’re assuming thatos.path.joinwill use/for a path separator.- Line 23 makes this wild assumption that every file dropped off in our bucket is going to always have a name with at least one underscore; meaning that we can always get the values
dsandtblwithout hitting aValueErrorexception complaining about there not being enough values to unpack.- There is also this highly unlikely scenario where between the time the function is triggered to run and the time it reaches line number 46, someone crazy fast person goes ahead and deletes the copied file or even the bucket itself. As a result, we would end trying to load a file that’s no longer there.
- Last, but far from being least, we have the
wait_for_jobfunction throw aRuntimeErrorif the load job is unsuccessful.
Any one of these scenarios could cause the script to throw an error. So, it’s very vital that we be able to see some type of logs when the script raises some exception. You’ll be happy to know that the logs do exist and can be fetched–I mean, this is Google after all. In any case, when your script runs, you can inspect most the available logs from within your GCP console or by using gcloud logs in your terminal.
The actual struggles
Much of the struggles stemmed from line number 49 of the python script. I accidentally forgot to prepend the argument to the source_uris parameter with the gs:// protocol. The latter is essential in helping the client library get to the actual file that needs to be loaded to BigQuery. Luckily, our wait_for_job should be able to catch this and throw a RuntimeError exception. However, according to the logs, the following happened to my function:
storage-2-bq-loader some-exec-id 2020-05-18 17:28:19.625 execution took 2121 ms, finished with status: 'crash'That’s about all I could get. In hindsight, I know to remember to add the gs:// protocol indicator, but I still want my stacktrace! As far as the documentation on Cloud Functions error reporting is concerned, the wait_for_job function should have dumped something to the standard error of my process. That something should have been reported in the logs of the function. It never made it there.
That is my struggles.
A little bit of hope
There is a glimmer of hope–but in the form of another operations-oriented product called Stackdriver or Operations. This, of course, entails that I have to tweak my function to report exceptions to Stackdriver rather than to throw said exceptions to the standard error of my process. In terms of source code changes, I would have to wrap the entirety of my function around a try-except to handle the base Exception. Only at this point, do I report my exception to Stackdriver. Further, reporting exceptions to Stackdriver can be handled using another Google-maintained python package called google-cloud-error-reporting. In terms of source code changes, we’d likely end up with the following:
| import os | |
| from google.cloud import error_reporting | |
| def main(event, data): | |
| """ | |
| Our new entry point | |
| Args: | |
| event (dict): Event payload. | |
| context (google.cloud.functions.Context): Metadata for the event. | |
| """ | |
| reporter = error_reporting.Client(os.getenv('PROJECT_ID', '')) | |
| try: | |
| hello_gcs(event, context) | |
| except Exception: | |
| reporter.report_exception() |
This can still fail when trying to create our reporter value.
Conclusion
While I am not particularly satisfied with adding more code and redeploying my function, I think I can kind of live with this.
Like all types of struggles, it often takes an outsider to help us realize what we’re doing wrong. So, feel free to share your wisdom below.