1. Discovery
Where to find my data?
One good publicly available datasource for my project is UN Comtrade Database: http://comtrade.un.org/data/doc/api/
Using this API I could query the trade data of specified commodities, in this case I am interested in Petroleum and Oil Products (cc=27). When I specified to get data from all years for all countries, the data is too large to be queried. So I query the data for all countries in each year, for example, for 2011 the query looks like this:
Data can be queried in son or csv format. I choose json. I collected data of 10 years from 2004-2014.
Sources for Trade flow
UN Comtrade Database is detailed and has good quality, however, it doesn’t include the trade flow, i.e. the flow from the exporting country to the importing country and the trade volume between them. I still need to find other sources for this information.
How about trade value in relations with population and GDP per capita?
I collected GDP data from UN database: http://data.un.org/Search.aspx?q=GDP+per+capita
and population data from WorldBank database: http://data.worldbank.org/indicator/SP.POP.TOTL
Do developed countries use more renewable energy?
To answer this, I also need data from renewable energy consumption.
Do developed countries use more renewable energy?
To answer this, I also need data from renewable energy consumption.
2. Wrangling
When checking the data, I realised that sometimes data is available in one year but not in others, e.g. data of UAE is available only prior to 2008. This will require wrangling data, to include countries for which data is comprehensive.
Since I need data from several sources, data integration is necessary and will give rise to integration issues. One challenge that I can foresee is the mismatch between the country encodings. For example, “CHN” or “Republic of China” or “R.P.C” all refer to China. To resolve this issue, I am planning to have a list of ISO 3166 code and map the codes to all possible names.
Another possible problem across databases is the different unit used. While in one database, trade unit can be in US Dollars, in another database it can be in volume (also mismatch of volume units, gallons or barrels or liters).
3. Profiling
Some assumptions might be made during this phase, for example, to disregard monetary inflation when representing trade value in US Dollars throughout different years, or to assumed a constant inflation rate for simplicity.
I need to consider carefully when to use trade values in Dollars and when to use trade volume, and how to merge these units.
4. Modeling
Scale will definitely an issue here because the number of countries are large and there are extremely large countries and extremely small countries and this will most likely affect import values. One possible solution is to let users choose a set of countries at a time, e.g. minimum 5 up to 15 countries at a time for the visuals. The scale will be recalculated each time based on the selected countries.
5. Reporting
Rickshaw and RAW, Tableau, etc. are high-level visualisation tools that allow us to plug in data and generate graphs efficiently. However these tools are not quite flexible. D3.js is more flexible but it requires steep learning curve for beginners. I plan to use D3.js if time permits and if I have good progress with previous phases. I might switch to RAW if I spend more time on explorative analysis.
No comments:
Post a Comment