CAVI RESOURCE HOME +++++ OUR REFLECTIONS +++++ HOW-TO +++++ COMMUNITY CONVERSATIONS
We are providing this technical insight into the project in the hopes that it can help other organizations be that for building something of their own or taking our initial work even further. To get more information than what is here, please get in touch with us!
The CAVI was built in a few stages:
These stages are simple to describe in hindsight, however our process was fairly exploratory. As described elsewhere in this set of resource pages, we built this prototype as much to fulfill a specific function as to use making as a way to understand the audio-video infrastructure of our time. We approached this with a DIY/hacker spirit, working to pull things apart and put them back together again in a new, sometimes unexpected, way.
The most obvious place to build our dataset from was through open data from funders. We focused on the Canada Council for the Arts, and informally cross referenced it with other provincial funders.
The Canada Council offers a wealth of open data (check here), however much of it does not include websites or social media channels. We began using Airtable to filter the data, add new information and explore the content.
Key Finding: If public funders released Open Documents that included all available websites and social media channels it would enable smaller arts organizations to do more with the information, faster. (It took us several days to manually find and add this information to the available open data).
We compared content from Facebook, TikTok, Instagram, Vimeo and YouTube. While the content landscape is always shifting, our informal observations led us to conclude that YouTube represented the central publication platform for most of the organizations we looked at. The CAVI is far from comprehensive, but in our initial explorations it was clear (and somewhat of a surprise to us) that YouTube was a very dominant force in the online infrastructure for this material.
Artengine is happy to share the list of 516 arts organizations with their corresponding websites and YouTube channels which we built the scraper from. Please download the csv file of the list here.
The open data from public funders was both an opportunity and limitation in this project. Artengine is very aware that this is only one version of Canadian arts content online. It is only a partial picture, and one that intersects, for better or worse, with a large state organization and its national and institutional mandate. The institutional approach to building the list gave us the opportunity to build a large list for our prototype with a clear limitation to the data in mind. We did not get involved in editorial or curatorial decisions, but used the funding structure (including all core funders and many projects in various Fields of Practice – a key funding term in the Canada Council data tables).
Based on the questions for inquiry, while also taking into account the peers selected for the consultation, four main themes stood out while consolidating the feedback and discussion notes.
Essentially, it was confirmed that the Index would be a very useful tool to get programming and curatorial work done. It would fast-track research to reveal Canadian art content and break through the corporate algorithm inherent to big streaming platforms. Although, it was agreed that there should be interface and structural improvements made to add value to how it searches and displays the data. Apart from this, there was mixed debate about its social impacts, how it may cause harm and the safeguards needed to protect artists.
Lastly, it is necessary to look at the combined feedback to understand that developing the CanCon AV Index, as a fully workable and successful tool, would require dedicated staff, operational processes and substantial funding. In order to do this, Artengine would have to significantly change its mandate to proceed with the development of this tool.
An example of a ‘sitemap’ (the term used by Webscraper) can be found here in a text file. The example file has a set of YouTube channels loaded into it from one of our scrapes. You can edit the file to include whatever channels you like or do this from within the Webscraper tool.
To use the sitemap:
In total, we extracted more than 33,000 videos on Youtube from over 500 arts organizations across Canada. We are happy to share the CSV file with all data compiled and cleaned for use. Please get in touch with us at cancon@artengine.ca .
Vast amounts of data like this are difficult to manage. For instance, transcript texts often are incredibly long and they cannot be properly stored and viewed in a single cell within a spreadsheet softwares like Excel or LibreOffice. Even though we collected a lot of data, we had difficulty handling it for clean up and preparation with the available software.
We decided to use Airtable again in this second stage as it met most of our needs and was very user friendly. We needed one of the highest paid tiers to accommodate the volume of data we output, but it made most of the clean up very easy. The data needed to be cleaned up because the extraction process is not an exact science. YouTube’s web page layout has text and links that appear redundant in the scraped compiled file. We used Airtable as a way to clean up the data through their search and replace extensions. We tried our best to make the data as presentable as we could, but there were limitations. One limitation that was particularly challenging was the long text field, which has a 100,000 character limit, meaning many of the longest automated transcripts could not be accommodated in Airtable. In the end, we used another system on our website to get those longer transcripts into the CAVI.
The last stage of prototyping was preparing a front-facing search function to access and play with the data scraped from YouTube. We decided that the best way to display and access the data would be to build a search engine on the Artengine’s WordPress website.
To build the Index interface into our website we used the following plugin tools:
These two plug-ins provide the specific database categories for the CAVI. Custom Post Type allows us to create a specific Post type that the Advanced Custom Fields are linked to. We mimicked the structure of the scraped data enabling us to import and create new instances of the 33,000+ items in our own WordPress site database.
Post and Pages are two key aspects to WordPress structure and a Custom Post type is essentially a key in the database driving the WordPress site. It allows us to use the data in that Post type in a number of different ways across the site. Here is more info on Posts and Pages from WordPress.
This plugin allows us to import and export the large CSV file created from the data scraping. We used a paid tier which allowed us to map the CSV file onto the custom post type we created using Advanced Custom Fields.
We built, deleted and rebuilt this part of the project a number of times in the process of testing and modifying our data. The plugin was flexible enough to add and update to the data we had already imported. However, after working with Airtable for the clean-up we deleted the database we had been working with on our site and rebuilt it with this plugin.
WP All Import also provided the solution to the character limitations from Airtable. Once we had imported all the data that Airtable was able to handle, we then used our original scraping files to add any items with very long transcripts into our site. With this combination of plug-ins our site provided the most complete version of the scraping data.
We used Elementor Pro as a powerful WYSIWG web editor. This allowed us to create custom web templates that could be applied across the 33,000+ entries in the CAVI Post type. It also allowed us to create custom individual pages (such as the Search Landing page or this page) but also could include pages that draw in and represent elements of the Custom Posts (ie the Search Results page).
For the Search functionality we used the Relevanssi plug-in. It has proved to be robust and thorough, building an index of the specific fields we indicated within the CAVI data. The fields indexed are Title, Organization, Description and Transcript. The plugin gives you some ability to use “” to get results for specific searches. It also has a fairly good synonym index and includes basic misspellings to increase accuracy.
Completeness and usability were not central to our process, rather we focused on prototyping and hacking as drivers of the design. We think of hacking as inspired from the DIY and maker movement, as a mode intervention that breaks, appropriates existing infrastructures and tools with the ethos of openness, sharing, and decentralization as a way to reflect, challenge and reimagine new ways of being in the world. It was important to this project that open-source and easily accessible tools were used, documented, and shared. That is why, although we had the option to hire a developer and build all the components by scratch, it was more valuable in our approach that tools and knowledge were accessible to those of us with some technical expertise and be able to share it with those who may not have an IT background. By hacking the tool ourselves, we reclaimed a sense of agency over the digital, understanding what limits could be pushed with our own limited set of technical skills.